Martin Ziqiao Ma (@ziqiao_ma) Twitter Tweets • TwiCopy

Hokin Deng

2 months ago

#ICML #cognition #GrowAI We spent 2 years carefully curated every single experiment (i.e. object permanence, A-not-B task, visual cliff task) in this dataset (total: 1503 classic experiments spanning 12 core cognitive concepts). We spent another year to get 230 MLLMs evaluated

thumb_up_off_alt524

chat_bubble_outline11

repeat72

shareShare

Martin Ziqiao Ma

@ziqiao_ma

2 months ago

Thanks for sharing!

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Tanishq Mathew Abraham, Ph.D.

@iscienceluvr

2 months ago

Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation "we introduce WM-ABench, a large-scale benchmark comprising 23 fine-grained evaluation dimensions across 6 diverse simulated environments with controlled counterfactual simulations. Through 660

thumb_up_off_alt296

chat_bubble_outline6

repeat66

shareShare

CLS

@chengleisi

2 months ago

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

thumb_up_off_alt553

chat_bubble_outline10

repeat162

shareShare

Qiyue Gao

@qiyuegao123

2 months ago

🤔 Have OpenAI o3, Gemini 2.5, Claude 3.7 formed an internal world model to understand the physical world, or just align pixels with words? We introduce WM-ABench, the first systematic evaluation of VLMs as world models. Using a cognitively-inspired framework, we test 15 SOTA

thumb_up_off_alt206

chat_bubble_outline3

repeat44

shareShare

Zhiting Hu

@zhitinghu

2 months ago

🚨Do frontier VLMs (o3, Gemini 2.5, Claude 3.5, Qwen…) actually learn an internal world model🌍? Surprisingly, the answer appears to be a hard NO—as revealed by our WM Atomic Benchmark⚛️. Even o3 struggles with the most basic, atomic-level questions: ❌Confuse triangles📐 with

thumb_up_off_alt152

chat_bubble_outline4

repeat29

shareShare

Martin Ziqiao Ma

@ziqiao_ma

2 months ago

Excited to share WM-ABench, the first atomic and controlled benchmark of internal world models in VLMs, to appear in #ACL2025 Findings. I'm particularly proud of the cognitively-inspired conceptual framework that grounds our design. If you're curious about how we formalize

thumb_up_off_alt11

chat_bubble_outline0

repeat1

shareShare

Dynamics Lab

@dynamicslab_ai

2 months ago

💥💥BANG! Experience the future of gaming with our real-time world model for video games!🕹️🕹️ Not just PLAY—but CREATE! Introducing Mirage, the world’s first AI-native UGC game engine. Now featuring real-time playable demos of two games: 🏙️ GTA-style urban chaos 🏎️ Forza

thumb_up_off_alt686

chat_bubble_outline28

repeat142

shareShare

Martin Ziqiao Ma

@ziqiao_ma

2 months ago

Just 5 days ago, I asked Xiang whether I should try using a distilled math reasoning model as the base for a VLM I’m training. He said no. I asked why. He said, “Stay tuned.” And now… here I am, reading this paper with the rest of you.

thumb_up_off_alt15

chat_bubble_outline1

repeat3

shareShare

Martin Ziqiao Ma

@ziqiao_ma

2 months ago

I know ACL and ICML are around the corner, but the only conference I’m planning to attend this month is #AX2025. But yeah, I did launch a job at the expo. 🤪

thumb_up_off_alt37

chat_bubble_outline2

repeat0

shareShare

Martin Ziqiao Ma

@ziqiao_ma

2 months ago

Our study on pragmatic generation is accepted to #COLM2025! Missed the first COLM last year (no suitable ongoing project at the time😅). Heard it’s a great place to connect with LM folks, excited to join for round two finally.

thumb_up_off_alt28

chat_bubble_outline0

repeat7

shareShare

Eric Xing

@ericxing

2 months ago

I have been long arguing that a world model is NOT about generating videos, but IS about simulating all possibilities of the world to serve as a sandbox for general-purpose reasoning via thought-experiments. This paper proposes an architecture toward that arxiv.org/abs/2507.05169

thumb_up_off_alt512

chat_bubble_outline7

repeat87

shareShare

Martin Ziqiao Ma

@ziqiao_ma

2 months ago

Attention is most effective on pre-compressed data at the “right level of abstraction.”

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Weijia Shi

@weijiashi2

2 months ago

Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data

thumb_up_off_alt197

chat_bubble_outline7

repeat59

shareShare

Zhengzhong Tu

@_vztu

2 months ago

🤨Ever dream of a tool that can magically restore and upscale any (low-res) photo to crystal-clear 4K? 🔥Introducing "4KAgent: Agentic Any Image to 4K Super-Resolution", the most capable upscaling generalist designed to handle broad image types. 🔗4kagent.github.io 1/🧵

thumb_up_off_alt206

chat_bubble_outline5

repeat41

shareShare

Martin Ziqiao Ma

@ziqiao_ma

2 months ago

📣 Excited to announce SpaVLE: #NeurIPS2025 Workshop on Space in Vision, Language, and Embodied AI! 👉 …vision-language-embodied-ai.github.io 🦾Co-organized with an incredible team → Freda Shi · Jiayuan Mao · Jiafei Duan · Manling Li · David Hsu · Parisa Kordjamshidi 🌌 Why Space & SpaVLE? We

thumb_up_off_alt62

chat_bubble_outline0

repeat23

shareShare

Jiafei Duan

@djiafei

2 months ago

📣 Excited to announce SpaVLE: #NeurIPS2025 Workshop on Space in Vision, Language, and Embodied AI! 👉…vision-language-embodied-ai.github.io

thumb_up_off_alt14

chat_bubble_outline1

repeat3

shareShare

Martin Ziqiao Ma

@ziqiao_ma

2 months ago

+1 on this! Mixed-effects models are such an underrated protocol for behavioral analysis that AI researchers often overlook. Behavioral data are almost never independent: clustering, repeated measures, and hierarchical structures abound. Mixed-effects models account for these

thumb_up_off_alt9

chat_bubble_outline1

repeat3

shareShare

Martin Ziqiao Ma

@ziqiao_ma

2 months ago

Wow lots of discussions about my paper...sorry I am a bit late to the party.😄 You raise an important point about prompt sensitivity. But it’s crucial to recognize the asymmetry in the logical implications of positive vs. negative evidence: -> To demonstrate that a system P

thumb_up_off_alt15

chat_bubble_outline0

repeat1

shareShare