Sanjay Subramanian (@sanjayssub) 's Twitter Profile
Sanjay Subramanian

@sanjayssub

Building/analyzing NLP and vision models. PhD student @berkeley_ai. Formerly: @allen_ai, @penn

ID: 1176913670057545729

linkhttps://people.eecs.berkeley.edu/~sanjayss/ calendar_today25-09-2019 17:37:49

254 Tweet

889 Followers

560 Following

Lea Müller (@leamue27) 's Twitter Profile Photo

- Humans and Structure from Motion - We jointly reconstruct 3D humans, scene point cloud, and cameras from images captured with sparse uncalibrated cameras. ✨Enjoy reading & happy holidays✨ Project page: muelea.github.io/hsfm

- Humans and Structure from Motion -

We jointly reconstruct 3D humans, scene point cloud, and cameras from images captured with sparse uncalibrated cameras.

✨Enjoy reading & happy holidays✨

Project page: muelea.github.io/hsfm
Jiaxin Ge (@aomaru_21490) 's Twitter Profile Photo

Introducing "AutoPresent: Designing Structured Visuals From Scratch". We employ code generation to create structured, high-quality presentation slides from scratch! 📄 arxiv.org/abs/2501.00912 🤗 huggingface.co/spaces/JiaxinG… 🔗 github.com/para-lost/Auto… Berkeley AI Research Language Technologies Institute | @CarnegieMellon

Eve Fleisig (@enfleisig) 's Twitter Profile Photo

How does model calibration stand up against humans? We ran live competitions, comparing model and human calibration, to create GRACE: a new fine-grained calibration benchmark grounded in human performance. What we found was unexpected! 🧵 📄arxiv.org/pdf/2502.19684

Zineng Tang (@zinengtang) 's Twitter Profile Photo

We are thrilled to announce TULIP! 🌷 tulip-berkeley.github.io A state of the vision language encoders coupled with generative model for stronger representation learning.

We are thrilled to announce TULIP!

🌷 tulip-berkeley.github.io

A state of the vision language encoders coupled with generative model for stronger representation learning.
Baifeng (@baifeng_shi) 's Twitter Profile Photo

Next-gen vision pre-trained models shouldn’t be short-sighted. Humans can easily perceive 10K x 10K resolution. But today’s top vision models—like SigLIP and DINOv2—are still pre-trained at merely hundreds by hundreds of pixels, bottlenecking their real-world usage. Today, we

Next-gen vision pre-trained models shouldn’t be short-sighted.

Humans can easily perceive 10K x 10K resolution. But today’s top vision models—like SigLIP and DINOv2—are still pre-trained at merely hundreds by hundreds of pixels, bottlenecking their real-world usage.

Today, we
Jiayi Pan (@jiayi_pirate) 's Twitter Profile Photo

We explore a new dimension in scaling reasoning models in Adaptive Parallel Reasoning APR lets LMs learn to orchestrate both serial & parallel compute E2E via supervised training + RL — w/ better efficiency and scalability than long CoT on Countdown 🧵 arxiv.org/abs/2504.15466

We explore a new dimension in scaling reasoning models in Adaptive Parallel Reasoning

APR lets LMs learn to orchestrate both serial & parallel compute E2E via supervised training + RL — w/ better efficiency and scalability than long CoT on Countdown

🧵 arxiv.org/abs/2504.15466
Nicholas Tomlin (@nickatomlin) 's Twitter Profile Photo

The long-term goal of AI is to build models that can handle arbitrary tasks, not just ones they’ve been trained on. We hope our new *benchmark generator* can help measure progress toward this vision

The long-term goal of AI is to build models that can handle arbitrary tasks, not just ones they’ve been trained on. We hope our new *benchmark generator* can help measure progress toward this vision
Ruiqi Zhong (@zhongruiqi) 's Twitter Profile Photo

Last day of PhD! I pioneered using LLMs to explain dataset&model. It's used by interp at OpenAI and societal impact Anthropic Tutorial here. It's a great direction & someone should carry the torch :) Thesis available, if you wanna read my acknowledgement section=P

Last day of PhD! 

I pioneered using LLMs to explain dataset&amp;model. It's used by interp at <a href="/OpenAI/">OpenAI</a>  and societal impact <a href="/AnthropicAI/">Anthropic</a> 

Tutorial here. It's a great direction &amp; someone should carry the torch :)

Thesis available, if you wanna read my acknowledgement section=P
Ritwik Gupta 🇺🇦 (@ritwik_g) 's Twitter Profile Photo

Ever wondered if the way we feed image patches to vision models is the best way? The standard row-by-row scan isn't always optimal! Modern long-sequence transformers can be surprisingly sensitive to patch order. We developed REOrder to find better, task-specific patch sequences.

Yutong Bai (@yutongbai1002) 's Twitter Profile Photo

What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to

Jessy Lin (@realjessylin) 's Twitter Profile Photo

User simulators bridge RL with real-world interaction // jessylin.com/2025/07/10/use… How do we get the RL paradigm to work on tasks beyond math & code? Instead of designing datasets, RL requires designing environments. Given that most non-trivial real-world tasks involve

User simulators bridge RL with real-world interaction //

jessylin.com/2025/07/10/use…

How do we get the RL paradigm to work on tasks beyond math &amp; code? Instead of designing datasets, RL requires designing environments. Given that most non-trivial real-world tasks involve
Baifeng (@baifeng_shi) 's Twitter Profile Photo

Understanding a video involves both short-range and long-range understanding. Short-range understanding is more about "motion" and requires system-1 perception. Long-range understanding is more system-2, and requires memory, reasoning, etc. Both have huge room for improvement.

Ruilong Li (@ruilong_li) 's Twitter Profile Photo

For everyone interested in precise 📷camera control 📷 in transformers [e.g., video / world model etc] Stop settling for Plücker raymaps -- use camera-aware relative PE in your attention layers, like RoPE (for LLMs) but for cameras! Paper & code: liruilong.cn/prope/

For everyone interested in precise 📷camera control 📷 in transformers [e.g., video / world model etc]

Stop settling for Plücker raymaps -- use camera-aware relative PE in your attention layers, like RoPE (for LLMs) but for cameras! 

Paper &amp; code: liruilong.cn/prope/
Lakshya A Agrawal (@lakshyaaagrawal) 's Twitter Profile Photo

How does prompt optimization compare to RL algos like GRPO? GRPO needs 1000s of rollouts, but humans can learn from a few trials—by reflecting on what worked & what didn't. Meet GEPA: a reflective prompt optimizer that can outperform GRPO by up to 20% with 35x fewer rollouts!🧵

How does prompt optimization compare to RL algos like GRPO?

GRPO needs 1000s of rollouts, but humans can learn from a few trials—by reflecting on what worked &amp; what didn't.

Meet GEPA: a reflective prompt optimizer that can outperform GRPO by up to 20% with 35x fewer rollouts!🧵