Shivam Chandhok (@shivamchandhok2) 's Twitter Profile
Shivam Chandhok

@shivamchandhok2

Computer Vision. Robotics. Iron Man fan. Coffee aficionado.
MSc (PhD Track) @UBC, Vancouver. Research Engineer @INRIA, France. Researcher @IIT Hyderabad.

ID: 2228129630

calendar_today03-12-2013 11:13:39

3,3K Tweet

175 Followers

483 Following

Yifu Qiu (@yifuqiu98) 's Twitter Profile Photo

🔁 What if you could bootstrap a world model (state1 × action → state2) using a much easier-to-train dynamics model (state1 × state2 → action) in a generalist VLM? 💡 We show how a dynamics model can generate synthetic trajectories & serve for inference-time verification 🧵👇

Yuki (@y_m_asano) 's Twitter Profile Photo

Today we release Franca, a new vision Foundation Model that matches and sometimes outperforms DINOv2. The data, the training code and the model weights (with intermediate checkpoints) are open-source, allowing everyone to build on this. Methodologically, we introduce two new

ℏεsam (@hesamation) 's Twitter Profile Photo

the legendary Daniel Han just made a full 3-hour workshop on reinforcement learning and agents. he goes through RL fundamentals, kernels, quantization, and RL+Agents covering both theory and code. great video to get up to speed on these topics.

the legendary <a href="/danielhanchen/">Daniel Han</a> just made a full 3-hour workshop on reinforcement learning and agents. 

he goes through RL fundamentals, kernels, quantization, and RL+Agents covering both theory and code. 

great video to get up to speed on these topics.
Benno Krojer (@benno_krojer) 's Twitter Profile Photo

Love to see this I am always hoping for papers that show that text-only understanding is influenced by being physically grounded (images, videos, interaction) It was a big hope of people years ago with few positive findings, glad it is still explored!

Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

Latent Denoising Makes Good Visual Tokenizers "we introduce the Latent Denoising Tokenizer (l-DeTok), a simple yet effective tokenizer trained to reconstruct clean images from latent embeddings corrupted by interpolative noise and random masking. Extensive experiments on

Latent Denoising Makes Good Visual Tokenizers

"we introduce the Latent Denoising Tokenizer (l-DeTok), a simple yet effective tokenizer trained to reconstruct clean images from latent embeddings corrupted by interpolative noise and random masking. Extensive experiments on
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*Emergence and Evolution of Interpretable Concepts in Diffusion Models* by Berk Tınaz Zalan Fabian Mahdi Soltanolkotabi SAEs trained on cross-attention layers of StableDiffusion are (surprisingly) good and can be used to intervene on the generation. arxiv.org/abs/2504.15473

*Emergence and Evolution of Interpretable Concepts in Diffusion Models*
by <a href="/berk_tinaz/">Berk Tınaz</a> <a href="/zalan_fabian/">Zalan Fabian</a> <a href="/mahdisoltanol/">Mahdi Soltanolkotabi</a>
 
SAEs trained on cross-attention layers of StableDiffusion are (surprisingly) good and can be used to intervene on the generation.
 
arxiv.org/abs/2504.15473
Yacine Mahdid (@yacinelearning) 's Twitter Profile Photo

15 min tutorial on the adam optimizer by the end of it you will understand what is up with the formula 100% you'll see it's not that complicated™️

15 min tutorial on the adam optimizer 
by the end of it you will understand what is up with the formula 100%

you'll see it's not that complicated™️
Fu-En (Fred) Yang (@fuenyang1) 's Twitter Profile Photo

🤖 How can we teach embodied agents to think before they act? 🚀 Introducing ThinkAct — a hierarchical Reasoning VLA framework with an MLLM for complex, slow reasoning and an action expert for fast, grounded execution. Slow think, fast act. 🧠⚡🤲

🤖 How can we teach embodied agents to think before they act?

🚀 Introducing ThinkAct — a hierarchical Reasoning VLA framework with an MLLM for complex, slow reasoning and an action expert for fast, grounded execution.
Slow think, fast act. 🧠⚡🤲
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*I-Con: A Unifying Framework for Representation Learning* by Shaden Mark Hamilton et al. They show that many losses (contrastive, supervised, clustering, ...) can be derived from a single loss defined in terms of neighbors distributions. arxiv.org/abs/2504.16929

*I-Con: A Unifying Framework for Representation Learning*
by <a href="/Sa_9810/">Shaden</a> <a href="/mhamilton723/">Mark Hamilton</a> et al.

They show that many losses (contrastive, supervised, clustering, ...) can be derived from a single loss defined in terms of neighbors distributions.

arxiv.org/abs/2504.16929
Micah Goldblum (@micahgoldblum) 's Twitter Profile Photo

🚨Announcing Zebra-CoT, a large-scale dataset of high quality interleaved image-text reasoning traces 📜. Humans often draw visual aids like diagrams when solving problems, but existing VLMs reason mostly in pure text. 1/n

🚨Announcing Zebra-CoT, a large-scale dataset of high quality interleaved image-text reasoning traces 📜.  Humans often draw visual aids like diagrams when solving problems, but existing VLMs reason mostly in pure text. 1/n
Mehul Damani @ ICLR (@mehuldamani2) 's Twitter Profile Photo

🚨New Paper!🚨 We trained reasoning LLMs to reason about what they don't know. o1-style reasoning training improves accuracy but produces overconfident models that hallucinate more. Meet RLCR: a simple RL method that trains LLMs to reason and reflect on their uncertainty --

🚨New Paper!🚨
We trained reasoning LLMs to reason about what they don't know.

o1-style reasoning training improves accuracy but produces overconfident models that hallucinate more.

Meet RLCR: a simple RL method that trains LLMs to reason and reflect on their uncertainty --
Salesforce AI Research (@sfresearch) 's Twitter Profile Photo

💡 Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models 💡 📄 Paper: bit.ly/44IAvuO 💻 Code: bit.ly/4lLjQgd 😵‍💫 Have a task but experiencing prompt engineering existential dread? Few-shot or zero-shot? Chain-of-thought or ReAct?

💡 Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models 💡

📄 Paper: bit.ly/44IAvuO 
💻 Code: bit.ly/4lLjQgd 

😵‍💫 Have a task but experiencing prompt engineering existential dread?

Few-shot or zero-shot? Chain-of-thought or ReAct?
Ruihan Yang (@rchalyang) 's Twitter Profile Photo

How can we leverage diverse human videos to improve robot manipulation? Excited to introduce EgoVLA — a Vision-Language-Action model trained on egocentric human videos by explicitly modeling wrist & hand motion. We build a shared action space between humans and robots, enabling

Denny Zhou (@denny_zhou) 's Twitter Profile Photo

Slides for my lecture “LLM Reasoning” at Stanford CS 25: dennyzhou.github.io/LLM-Reasoning-… Key points: 1. Reasoning in LLMs simply means generating a sequence of intermediate tokens before producing the final answer. Whether this resembles human reasoning is irrelevant. The crucial

Duy Nguyen (@duynguyen772) 's Twitter Profile Photo

🚀 We introduce GrAInS, a gradient-based attribution method for inference-time steering (of both LLMs & VLMs). ✅ Works for both LLMs (+13.2% on TruthfulQA) & VLMs (+8.1% win rate on SPA-VL). ✅ Preserves core abilities (<1% drop on MMLU/MMMU). LLMs & VLMs often fail because

🚀 We introduce GrAInS, a gradient-based attribution method for inference-time steering (of both LLMs &amp; VLMs).

✅ Works for both LLMs (+13.2% on TruthfulQA) &amp; VLMs (+8.1% win rate on SPA-VL).
✅ Preserves core abilities (&lt;1% drop on MMLU/MMMU).

LLMs &amp; VLMs often fail because