Roland Memisevic (@rolandmemisevic) 's Twitter Profile
Roland Memisevic

@rolandmemisevic

PhD, U Toronto 2008 (advisor Geoff Hinton)
Faculty @ MILA (until 2016)
Co-founder/CEO Twenty Billion Neurons (acquired 2021)
Qualcomm AI research since 2021

ID: 843157972356403200

linkhttps://www.iro.umontreal.ca/~memisevr/ calendar_today18-03-2017 17:51:40

154 Tweet

401 Followers

376 Following

Apratim Bhattacharyya (@apratimbh) 's Twitter Profile Photo

Qualcomm AI Research is looking for interns (PhD/Master's) in Toronto, in the area of LLMs, multi-modality and agents. Job posting: tinyurl.com/3tyavey5

Apratim Bhattacharyya (@apratimbh) 's Twitter Profile Photo

Excited to present our #ICLR2024 paper “Look, Remember and Reason: Grounded Reasoning in Videos with Language Models” (arxiv.org/pdf/2306.17778…). Our method: LRR, is current ranked 1st on the STAR leaderboard: eval.ai/web/challenges… 1/3

Excited to present our #ICLR2024 paper “Look, Remember and Reason: Grounded Reasoning in Videos with Language Models” (arxiv.org/pdf/2306.17778…). Our method: LRR, is current ranked 1st on the STAR leaderboard: eval.ai/web/challenges…

1/3
Roland Memisevic (@rolandmemisevic) 's Twitter Profile Photo

A widely held believe is that difficult vision tasks require vision components like object detectors during inference. We show that we can push any of that stuff into the training stage and get SOTA on challenging visual reasoning tasks. End-to-end visual reasoning on raw pixels.

Roland Memisevic (@rolandmemisevic) 's Twitter Profile Photo

I remember that well, have been looking for it recently but couldn't find it. Also features Jitendra Malik as far as I remember.

Roland Memisevic (@rolandmemisevic) 's Twitter Profile Photo

Even finnicky visual reason tasks are best solved end-to-end. By distilling visual subroutines, like object detection, into the model during training... Check out our poster at #ICLR2024

Roland Memisevic (@rolandmemisevic) 's Twitter Profile Photo

It is tempting to view the context window of an LLM like an "array" - with elements you can access like in an addressable memory. In this work, we argue that this entrenched, but wrong, view may be at the heart of problems like the inability of LLMs to length-generalize.

JB (@iamjbdel) 's Twitter Profile Photo

#COLM24 is live! Head over to paper-central to browse the proceedings and access each paper's 🤗 paper-page, where you can check out discussions, open-source resources, GitHub links, and OpenReview peer review conversations—all in one spot. By the way, this new conference

#COLM24 is live! Head over to paper-central to browse the proceedings and access each paper's 🤗 paper-page, where you can check out discussions, open-source resources, GitHub links, and OpenReview peer review conversations—all in one spot.

By the way, this new conference
Apratim Bhattacharyya (@apratimbh) 's Twitter Profile Photo

🚨 Don't miss out our #NeurIPS2024 (D&B track) poster "Live Fitness Coaching as a Testbed for Situated Interaction". arXiv: arxiv.org/abs/2407.08101 Dataset: qualcomm.com/developer/soft… Code: github.com/Qualcomm-AI-re… (coming shortly) 📅 Fri 13 Dec 4:30 p.m. PST — 7:30 p.m. PST

Apratim Bhattacharyya (@apratimbh) 's Twitter Profile Photo

Join us at the CVPR 2025 Workshop on Vision-based Assistants in the Real-world (VAR) and tackle one of AI's biggest challenges: building systems that can comprehend and reason about dynamic, real-world scenes. Workshop Page: varworkshop.github.io #CVPR2025 1/2

Join us at the CVPR 2025 Workshop on Vision-based Assistants in the Real-world (VAR) and tackle one of AI's biggest challenges: building systems that can comprehend and reason about dynamic, real-world scenes.

Workshop Page: varworkshop.github.io

<a href="/CVPR/">#CVPR2025</a> 

1/2
Roland Memisevic (@rolandmemisevic) 's Twitter Profile Photo

There is a very simple possible explanation: The transformer architecture lacks a simple inductive bias - that of inductive ("step-by-step") inference. The lack of that bias leads to insane data inefficiency. If true, humans + RNNs will learn this task easily with much less data.

Roland Memisevic (@rolandmemisevic) 's Twitter Profile Photo

A few years ago I lost a bet, saying by 2021 you could have a video call with someone without being able to tell if it's AI or human. It's 2025 and we're half-way there (the missing half is making the AI model see properly through the camera, which is still an unsolved problem).

Roland Memisevic (@rolandmemisevic) 's Twitter Profile Photo

No existing AI model can talk to a user in the real world and understand what's happening _right now_. This is called situated AI and it's still a wide open problem.

Roland Memisevic (@rolandmemisevic) 's Twitter Profile Photo

Binary parity ("count the number of 1s in a bit-string") is a common task to show that transformers cannot generalize. Turns out, a random(!) RNN (train only readout) can learn the task easily, and it can do so with as few as 2 (two) training examples...: arxiv.org/pdf/2505.21749