Stanford Vision and Learning Lab (@stanfordsvl) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

What makes a maze look like a maze? Humans can reason about infinitely many instantiations of mazes—made of candy canes, sticks, icing, yarn, etc. But VLMs often struggle to make sense of such visual abstractions. We improve VLMs' ability to interpret these abstract concepts.

thumb_up_off_alt232

chat_bubble_outline5

repeat35

shareShare

Hong-Xing "Koven" Yu

@koven_yu

10 months ago

🔥Spatial intelligence needs fast, *interactive* 3D world generation 🎮 — introducing WonderWorld: generating 3D scenes interactively following your movement and content requests, and see them in <10 seconds! 🧵1/6 Web: kovenyu.com/WonderWorld/ arXiv: arxiv.org/pdf/2406.09394

thumb_up_off_alt1,1K

chat_bubble_outline26

repeat228

shareShare

Joy Hsu

@joycjhsu

10 months ago

We're organizing the first Workshop on Visual Concepts at European Conference on Computer Vision #ECCV2026 with an incredible lineup of speakers! Join us on Sep 30 afternoon at Suite 3, MiCo Milano 🇮🇹 sites.google.com/cs.stanford.ed…

We're organizing the first Workshop on Visual Concepts at <a href="/eccvconf/">European Conference on Computer Vision #ECCV2026</a> with an incredible lineup of speakers!

Join us on Sep 30 afternoon at Suite 3, MiCo Milano 🇮🇹

sites.google.com/cs.stanford.ed…

thumb_up_off_alt118

chat_bubble_outline4

repeat15

shareShare

Fan-Yun Sun

@sunfanyun

10 months ago

Training RL/robot policies requires extensive experience in the target environment, which is often difficult to obtain. How can we “distill” embodied policies from foundational models? Introducing FactorSim! #NeurIPS2024 We show that by generating prompt-aligned simulations and

thumb_up_off_alt212

chat_bubble_outline2

repeat45

shareShare

Tianyuan Dai

@rogerdai1217

10 months ago

Why hand-engineer digital twins when digital cousins are free? Check out ACDC: Automated Creation of Digital Cousins 👭 for Robust Policy Learning, accepted at @corl2024! 🎉 📸 Single image -> 🏡 Interactive scene ⏩ Fully automatic (no annotations needed!) 🦾 Robot policies

thumb_up_off_alt160

chat_bubble_outline11

repeat39

shareShare

Manling Li

@manlingli_

9 months ago

[NeurIPS D&B Oral] Embodied Agent Interface: Benchmarking LLMs for Embodied Agents A single line of code to evaluate your model! 🌟Standardize Goal Specifications: LTL 🌟Standardize Modules and Interfaces: 4 modules, 438 tasks, 1475 goals 🌟Standardize Fine-grained Metrics: 18

thumb_up_off_alt282

chat_bubble_outline5

repeat69

shareShare

Yunong Liu

@yunongliu1

8 months ago

💫🪑Introducing IKEA Manuals at Work: The first multimodal dataset with extensive 4D groundings of assembly in internet videos! We track furniture parts’ 6-DoF poses and segmentation masks through the assembly process, revealing how parts connect in both 2D and 3D space. With

thumb_up_off_alt174

chat_bubble_outline5

repeat44

shareShare

Keshigeyan Chandrasegaran

@keshigeyan

8 months ago

1/ [NeurIPS D&B] Introducing HourVideo: A benchmark for hour-long video-language understanding!🚀 500 egocentric videos, 18 total tasks & ~13k questions! Performance: GPT-4➡️25.7% Gemini 1.5 Pro➡️37.3% Humans➡️85.0% We highlight a significant gap in multimodal capabilities🧵👇

thumb_up_off_alt184

chat_bubble_outline3

repeat52

shareShare

Hong-Xing "Koven" Yu

@koven_yu

7 months ago

🤩Forget MoCap -- Let’s generate human interaction motions with *Real-world 3D scenes*!🏃🏞️ Introducing ZeroHSI: Zero-Shot 4D Human-Scene Interaction by Video Generation. No training, No MoCap data! 🧵1/5 Web: awfuact.github.io/zerohsi/

thumb_up_off_alt271

chat_bubble_outline12

repeat60

shareShare

Joy Hsu

@joycjhsu

6 months ago

Excited to bring back the 2nd Workshop on Visual Concepts at #CVPR2025 2025, this time with a call for papers! We welcome submissions on the following topics. See our website for more info: sites.google.com/stanford.edu/w… Join us & a fantastic lineup of speakers in Tennessee!

Excited to bring back the 2nd Workshop on Visual Concepts at <a href="/CVPR/">#CVPR2025</a> 2025, this time with a call for papers!

We welcome submissions on the following topics. See our website for more info:
sites.google.com/stanford.edu/w…

Join us & a fantastic lineup of speakers in Tennessee!

thumb_up_off_alt135

chat_bubble_outline1

repeat23

shareShare

Yunfan Jiang

@yunfanjiang

5 months ago

🚀Two weeks ago, we hosted a welcome party for the newest member of our Stanford Vision and Learning Lab—a new robot! 🤖✨Watch as Fei-Fei Li interacts with it in this fun video. Exciting release coming soon. Stay tuned! 👀🎉

thumb_up_off_alt213

chat_bubble_outline9

repeat30

shareShare

Yunfan Jiang

@yunfanjiang

5 months ago

🤖 Ever wondered what robots need to truly help humans around the house? 🏡 Introducing 𝗕𝗘𝗛𝗔𝗩𝗜𝗢𝗥 𝗥𝗼𝗯𝗼𝘁 𝗦𝘂𝗶𝘁𝗲 (𝗕𝗥𝗦)—a comprehensive framework for mastering mobile whole-body manipulation across diverse household tasks! 🧹🫧 From taking out the trash to

thumb_up_off_alt416

chat_bubble_outline17

repeat146

shareShare

Fan-Yun Sun

@sunfanyun

4 months ago

Spatial reasoning is a major challenge for the foundation models today, even in simple tasks like arranging objects in 3D space. #CVPR2025 Introducing LayoutVLM, a differentiable optimization framework that uses VLM to spatially reason about diverse scene layouts from unlabeled

thumb_up_off_alt239

chat_bubble_outline4

repeat58

shareShare

Hong-Xing "Koven" Yu

@koven_yu

4 months ago

🔥Want to capture 3D dancing fluids♨️🌫️🌪️💦? No specialized equipment, just one video! Introducing FluidNexus: Now you only need one camera to reconstruct 3D fluid dynamics and predict future evolution! 🧵1/4 Web: yuegao.me/FluidNexus/ Arxiv: arxiv.org/pdf/2503.04720

thumb_up_off_alt114

chat_bubble_outline5

repeat96

shareShare

Hong-Xing "Koven" Yu

@koven_yu

4 months ago

🔥Spatial intelligence requires world generation, and now we have the first comprehensive evaluation benchmark📏 for it! Introducing WorldScore: Unifying evaluation for 3D, 4D, and video models on world generation! 🧵1/7 Web: haoyi-duan.github.io/WorldScore/ arxiv: arxiv.org/abs/2504.00983

thumb_up_off_alt244

chat_bubble_outline6

repeat116

shareShare

Emily Jin

@emilyzjin

3 months ago

State classification of objects and their relations (e.g. the cup is next to the plate) is core to many tasks like robot planning and manipulation. But dynamic real-world environments often require models to generalize to novel predicates from few examples. We present PHIER, a

thumb_up_off_alt55

chat_bubble_outline4

repeat5

shareShare

Joy Hsu

@joycjhsu

3 months ago

We'll be presenting Deep Schema Grounding at ICLR 2026 🇸🇬 on Thursday (session 1 #98). Come chat about abstract visual concepts, structured decomposition, & what makes a maze a maze! & test your models on our challenging Visual Abstractions Benchmark: stanford.edu/~joycj/project…

thumb_up_off_alt33

chat_bubble_outline1

repeat2

shareShare

Keshigeyan Chandrasegaran

@keshigeyan

a month ago

1/ Model architectures have been mostly treated as fixed post-training. 🌱 Introducing Grafting: A new way to edit pretrained diffusion transformers, allowing us to customize architectural designs on a small compute budget. 🌎 grafting.stanford.edu Co-led with Michael Poli

thumb_up_off_alt117

chat_bubble_outline5

repeat28

shareShare

Yunzhi Zhang

@zhang_yunzhi

a month ago

(1/n) Time to unify your favorite visual generative models, VLMs, and simulators for controllable visual generation—Introducing a Product of Experts (PoE) framework for inference-time knowledge composition from heterogeneous models.

thumb_up_off_alt296

chat_bubble_outline4

repeat61

shareShare

Hong-Xing "Koven" Yu

@koven_yu

a month ago

#ICCV2025 🤩3D world generation is cool, but it is cooler to play with the worlds using 3D actions 👆💨, and see what happens! — Introducing *WonderPlay*: Now you can create dynamic 3D scenes that respond to your 3D actions from a single image! Web: kyleleey.github.io/WonderPlay/ 🧵1/7

thumb_up_off_alt175

chat_bubble_outline5

repeat37

shareShare