Jay Karhade (@jaykarhade) 's Twitter Profile
Jay Karhade

@jaykarhade

PhD Robotics @CMU_Robotics, Computer Vision, Robotics.

ID: 1567636996993998852

linkhttps://jaykarhade.github.io/ calendar_today07-09-2022 22:12:58

127 Tweet

355 Followers

388 Following

Andrew Davison (@ajddavison) 's Twitter Profile Photo

All researchers should fight against this. Every week I try to persuade my students that top papers often have few quantitative results. With work that's new, important, and clearly qualitatively different (zero to one!), you don't need quantitative results. Demos not tables!

Kyle Sargent (@kylesargentai) 's Twitter Profile Photo

Modern generative models of images and videos rely on tokenizers. Can we build a state-of-the-art discrete image tokenizer with a diffusion autoencoder? Yes! I’m excited to share FlowMo, with Kyle Hsu, Justin Johnson, Fei-Fei Li, Jiajun Wu. A thread 🧵:

Hong-Xing "Koven" Yu (@koven_yu) 's Twitter Profile Photo

🔥Spatial intelligence requires world generation, and now we have the first comprehensive evaluation benchmark📏 for it! Introducing WorldScore: Unifying evaluation for 3D, 4D, and video models on world generation! 🧵1/7 Web: haoyi-duan.github.io/WorldScore/ arxiv: arxiv.org/abs/2504.00983

Khiem Vuong (@kvuongdev) 's Twitter Profile Photo

[1/6] Recent models like DUSt3R generalize well across viewpoints, but performance drops on aerial-ground pairs. At #CVPR2025, we propose AerialMegaDepth (aerial-megadepth.github.io), a hybrid dataset combining mesh renderings with real ground images (MegaDepth) to bridge this gap.

Zhiqiu Lin (@zhiqiulin) 's Twitter Profile Photo

Fresh GPT‑o3 results on our vision‑centric #NaturalBench (NeurIPS’24) benchmark! 🎯 Its new visual chain‑of‑thought—by “zooming in” on details—cracks questions that still stump GPT‑4o. Yet vision reasoning isn’t solved: o3 can still hallucinate even after a full minute of

Fresh GPT‑o3 results on our vision‑centric #NaturalBench (NeurIPS’24) benchmark! 🎯 Its new visual chain‑of‑thought—by “zooming in” on details—cracks questions that still stump GPT‑4o.

Yet vision reasoning isn’t solved: o3 can still hallucinate even after a full minute of
Ishan Khatri ✈️ ICLR'25 (@i_ikhatri) 's Twitter Profile Photo

Just over a month left to submit to this year's Argoverse 2 challenges! Returning from previous years, are our motion forecasting and lidar scene flow challenges. And NEW for this year with a $10k prize pool is our Scenario Mining challenge! 🧵👇

Just over a month left to submit to this year's Argoverse 2 challenges! Returning from previous years, are our motion forecasting and lidar scene flow challenges. And NEW for this year with a $10k prize pool is our Scenario Mining challenge! 🧵👇
Chris Rockwell (@_crockwell) 's Twitter Profile Photo

Ever wish YouTube had 3D labels? 🚀Introducing🎥DynPose-100K🎥, an Internet-scale collection of diverse videos annotated with camera pose! Applications include camera-controlled video generation🤩and learned dynamic pose estimation😯 Download: huggingface.co/datasets/nvidi…

Jay Karhade (@jaykarhade) 's Twitter Profile Photo

Super cool project to have been involved in! Camera motion understanding is far from solved — even top SLAM/SfM and VLM models struggle in the wild. CameraBench pushes the frontier with high-quality annotations and cinematographer-designed taxonomy. VLMs 🤝 SFM next ?😉

Chuang Gan (@gan_chuang) 's Twitter Profile Photo

What a fun collaboration with Zhiqiu on this summer internship project! Understanding camera motion in videos is extremely challenging, and this CameraBench will be critically important for both video captioning and video generation!

Hanwen Jiang (@hanwenjiang1) 's Twitter Profile Photo

Supervised learning has held 3D Vision back for too long. Meet RayZer — a self-supervised 3D model trained with zero 3D labels: ❌ No supervision of camera & geometry ✅ Just RGB images And the wild part? RayZer outperforms supervised methods (as 3D labels from COLMAP is noisy)

Justin Johnson (@jcjohnss) 's Twitter Profile Photo

Compute increases in the last ~decade are insane. The B200 is 1000x faster than the K40 that was state-of-the-art when I started my PhD. We used to train on 1 GPU; now 10K+ is common. Combining these gives a speedup of 10 million since 2013. This explosion led to modern AI.

Compute increases in the last ~decade are insane. The  B200 is 1000x faster than the K40 that was state-of-the-art when I started my PhD.

We used to train on 1 GPU; now 10K+ is common.

Combining these gives a speedup of 10 million since 2013. This explosion led to modern AI.
Akash Sharma (@akashshrm02) 's Twitter Profile Photo

Last week I passed my thesis proposal, and I'm now officially a Ph.D. candidate! My proposed thesis "Self supervised perception for tactile dexterity" will explore ways to improve dexterous manipulation using tactile reps. Thanks to my committee and everyone that supported me!

Last week I passed my thesis proposal, and I'm now officially a Ph.D. candidate! 
My proposed thesis "Self supervised perception for tactile dexterity" will explore ways to improve dexterous manipulation using tactile reps.

Thanks to my committee and everyone that supported me!
Akash Sharma (@akashshrm02) 's Twitter Profile Photo

Robots need touch for human-like hands to reach the goal of general manipulation. However, approaches today don’t use tactile sensing or use specific architectures per tactile task. Can 1 model improve many tactile tasks? 🌟Introducing Sparsh-skin: tinyurl.com/y935wz5c 1/6

Mihir Prabhudesai (@mihirp98) 's Twitter Profile Photo

Excited to share our work: Maximizing Confidence Alone Improves Reasoning Humans rely on confidence to learn when answer keys aren’t available (e.g taking an exam). Surprisingly, LLMs can also learn w/o ground-truth answers, simply by reinforcing high-confidence answers via RL!

Fei-Fei Li (@drfeifei) 's Twitter Profile Photo

Check out this shiny new, fast and dynamic web renderer for 3D Gaussian Splats! The things one could do are just mind boggling! So proud of the World Labs team that made this happen, and we are making this open source for everyone!

Jay Karhade (@jaykarhade) 's Twitter Profile Photo

UFM is a step forward towards solving the top 3 problems of computer vision: Correspondence, Correspondence and Correspondence 🙃 Exciting colab which was led by Yuchen Zhang! 1 year in the making, and lots of engineering and insights uncovered!

Zhenjun Zhao (@zhenjun_zhao) 's Twitter Profile Photo

UFM: A Simple Path towards Unified Dense Correspondence with Flow Yuchen Zhang, Nikhil Keetha, Chenwei Lyu, Bhuvan Jhamb, Yutian Chen, Yuheng Qiu, Jay Karhade, Shreyas Jha, Yaoyu HU, Deva Ramanan, Sebastian Scherer, Wenshan Wang tl;dr: transformer-based architecture using covisibility

UFM: A Simple Path towards Unified Dense Correspondence with Flow

<a href="/YuchenZhan54250/">Yuchen Zhang</a>, <a href="/Nik__V__/">Nikhil Keetha</a>, Chenwei Lyu, <a href="/robo2902/">Bhuvan Jhamb</a>, Yutian Chen, <a href="/QiuYuhengQiu/">Yuheng Qiu</a>, <a href="/JayKarhade/">Jay Karhade</a>, Shreyas Jha, <a href="/YaoyuHU/">Yaoyu HU</a>, <a href="/RamananDeva/">Deva Ramanan</a>, <a href="/smash0190/">Sebastian Scherer</a>, Wenshan Wang

tl;dr: transformer-based architecture using covisibility
Haoyu Xiong (@haoyu_xiong_) 's Twitter Profile Photo

Your bimanual manipulators might need a Robot Neck 🤖🦒 Introducing Vision in Action: Learning Active Perception from Human Demonstrations ViA learns task-specific, active perceptual strategies—such as searching, tracking, and focusing—directly from human demos, enabling robust