Stanford Vision and Learning Lab (@stanfordsvl) 's Twitter Profile
Stanford Vision and Learning Lab

@stanfordsvl

SVL is led by @drfeifei @silviocinguetta @jcniebles @jiajunwu_cs and works on machine learning, computer vision, robotics and language

ID: 2832190776

linkhttp://svl.stanford.edu calendar_today25-09-2014 20:16:40

356 Tweet

15,15K Followers

150 Following

Joy Hsu (@joycjhsu) 's Twitter Profile Photo

What makes a maze look like a maze? Humans can reason about infinitely many instantiations of mazes—made of candy canes, sticks, icing, yarn, etc. But VLMs often struggle to make sense of such visual abstractions. We improve VLMs' ability to interpret these abstract concepts.

Hong-Xing "Koven" Yu (@koven_yu) 's Twitter Profile Photo

🔥Spatial intelligence needs fast, *interactive* 3D world generation 🎮 — introducing WonderWorld: generating 3D scenes interactively following your movement and content requests, and see them in <10 seconds! 🧵1/6 Web: kovenyu.com/WonderWorld/ arXiv: arxiv.org/pdf/2406.09394

Joy Hsu (@joycjhsu) 's Twitter Profile Photo

We're organizing the first Workshop on Visual Concepts at European Conference on Computer Vision #ECCV2026 with an incredible lineup of speakers! Join us on Sep 30 afternoon at Suite 3, MiCo Milano 🇮🇹 sites.google.com/cs.stanford.ed…

We're organizing the first Workshop on Visual Concepts at <a href="/eccvconf/">European Conference on Computer Vision #ECCV2026</a> with an incredible lineup of speakers!

Join us on Sep 30 afternoon at Suite 3, MiCo Milano 🇮🇹

sites.google.com/cs.stanford.ed…
Fan-Yun Sun (@sunfanyun) 's Twitter Profile Photo

Training RL/robot policies requires extensive experience in the target environment, which is often difficult to obtain. How can we “distill” embodied policies from foundational models? Introducing FactorSim! #NeurIPS2024 We show that by generating prompt-aligned simulations and

Tianyuan Dai (@rogerdai1217) 's Twitter Profile Photo

Why hand-engineer digital twins when digital cousins are free? Check out ACDC: Automated Creation of Digital Cousins 👭 for Robust Policy Learning, accepted at @corl2024! 🎉 📸 Single image -> 🏡 Interactive scene ⏩ Fully automatic (no annotations needed!) 🦾 Robot policies

Manling Li (@manlingli_) 's Twitter Profile Photo

[NeurIPS D&B Oral] Embodied Agent Interface: Benchmarking LLMs for Embodied Agents A single line of code to evaluate your model! 🌟Standardize Goal Specifications: LTL 🌟Standardize Modules and Interfaces: 4 modules, 438 tasks, 1475 goals 🌟Standardize Fine-grained Metrics: 18

Yunong Liu (@yunongliu1) 's Twitter Profile Photo

💫🪑Introducing IKEA Manuals at Work: The first multimodal dataset with extensive 4D groundings of assembly in internet videos! We track furniture parts’ 6-DoF poses and segmentation masks through the assembly process, revealing how parts connect in both 2D and 3D space. With

Keshigeyan Chandrasegaran (@keshigeyan) 's Twitter Profile Photo

1/ [NeurIPS D&B] Introducing HourVideo: A benchmark for hour-long video-language understanding!🚀 500 egocentric videos, 18 total tasks & ~13k questions! Performance: GPT-4➡️25.7% Gemini 1.5 Pro➡️37.3% Humans➡️85.0% We highlight a significant gap in multimodal capabilities🧵👇

Hong-Xing "Koven" Yu (@koven_yu) 's Twitter Profile Photo

🤩Forget MoCap -- Let’s generate human interaction motions with *Real-world 3D scenes*!🏃🏞️ Introducing ZeroHSI: Zero-Shot 4D Human-Scene Interaction by Video Generation. No training, No MoCap data! 🧵1/5 Web: awfuact.github.io/zerohsi/

Joy Hsu (@joycjhsu) 's Twitter Profile Photo

Excited to bring back the 2nd Workshop on Visual Concepts at #CVPR2025 2025, this time with a call for papers! We welcome submissions on the following topics. See our website for more info: sites.google.com/stanford.edu/w… Join us & a fantastic lineup of speakers in Tennessee!

Excited to bring back the 2nd Workshop on Visual Concepts at <a href="/CVPR/">#CVPR2025</a> 2025, this time with a call for papers!

We welcome submissions on the following topics. See our website for more info:
sites.google.com/stanford.edu/w…

Join us &amp; a fantastic lineup of speakers in Tennessee!
Yunfan Jiang (@yunfanjiang) 's Twitter Profile Photo

🚀Two weeks ago, we hosted a welcome party for the newest member of our Stanford Vision and Learning Lab—a new robot! 🤖✨Watch as Fei-Fei Li interacts with it in this fun video. Exciting release coming soon. Stay tuned! 👀🎉

Yunfan Jiang (@yunfanjiang) 's Twitter Profile Photo

🤖 Ever wondered what robots need to truly help humans around the house? 🏡 Introducing 𝗕𝗘𝗛𝗔𝗩𝗜𝗢𝗥 𝗥𝗼𝗯𝗼𝘁 𝗦𝘂𝗶𝘁𝗲 (𝗕𝗥𝗦)—a comprehensive framework for mastering mobile whole-body manipulation across diverse household tasks! 🧹🫧 From taking out the trash to

Fan-Yun Sun (@sunfanyun) 's Twitter Profile Photo

Spatial reasoning is a major challenge for the foundation models today, even in simple tasks like arranging objects in 3D space. #CVPR2025 Introducing LayoutVLM, a differentiable optimization framework that uses VLM to spatially reason about diverse scene layouts from unlabeled

Hong-Xing "Koven" Yu (@koven_yu) 's Twitter Profile Photo

🔥Want to capture 3D dancing fluids♨️🌫️🌪️💦? No specialized equipment, just one video! Introducing FluidNexus: Now you only need one camera to reconstruct 3D fluid dynamics and predict future evolution! 🧵1/4 Web: yuegao.me/FluidNexus/ Arxiv: arxiv.org/pdf/2503.04720

Hong-Xing "Koven" Yu (@koven_yu) 's Twitter Profile Photo

🔥Spatial intelligence requires world generation, and now we have the first comprehensive evaluation benchmark📏 for it! Introducing WorldScore: Unifying evaluation for 3D, 4D, and video models on world generation! 🧵1/7 Web: haoyi-duan.github.io/WorldScore/ arxiv: arxiv.org/abs/2504.00983

Emily Jin (@emilyzjin) 's Twitter Profile Photo

State classification of objects and their relations (e.g. the cup is next to the plate) is core to many tasks like robot planning and manipulation. But dynamic real-world environments often require models to generalize to novel predicates from few examples. We present PHIER, a

Joy Hsu (@joycjhsu) 's Twitter Profile Photo

We'll be presenting Deep Schema Grounding at ICLR 2026 🇸🇬 on Thursday (session 1 #98). Come chat about abstract visual concepts, structured decomposition, & what makes a maze a maze! & test your models on our challenging Visual Abstractions Benchmark: stanford.edu/~joycj/project…

Keshigeyan Chandrasegaran (@keshigeyan) 's Twitter Profile Photo

1/ Model architectures have been mostly treated as fixed post-training. 🌱 Introducing Grafting: A new way to edit pretrained diffusion transformers, allowing us to customize architectural designs on a small compute budget. 🌎 grafting.stanford.edu Co-led with Michael Poli

Yunzhi Zhang (@zhang_yunzhi) 's Twitter Profile Photo

(1/n) Time to unify your favorite visual generative models, VLMs, and simulators for controllable visual generation—Introducing a Product of Experts (PoE) framework for inference-time knowledge composition from heterogeneous models.

Hong-Xing "Koven" Yu (@koven_yu) 's Twitter Profile Photo

#ICCV2025 🤩3D world generation is cool, but it is cooler to play with the worlds using 3D actions 👆💨, and see what happens! — Introducing *WonderPlay*: Now you can create dynamic 3D scenes that respond to your 3D actions from a single image! Web: kyleleey.github.io/WonderPlay/ 🧵1/7