Simran Arora (@simran_s_arora) 's Twitter Profile
Simran Arora

@simran_s_arora

cs @StanfordAILab @hazyresearch

ID: 4712264894

linkhttps://arorasimran.com/ calendar_today05-01-2016 06:18:44

311 Tweet

3,3K Followers

193 Following

Neel Guha (@neelguha) 's Twitter Profile Photo

Really cool work led by Sabri Eyuboglu Ryan Ehrlich Simran Arora! The idea of training a "cartridge" which represents the knowledge in a document (or corpora), and can be slotted into LLMs to support engagement has tons of applications/practical importance for law (1/4)

Ryan Ehrlich (@ryansehrlich) 's Twitter Profile Photo

Giving LLMs very large amounts of context can be really useful, but it can also be slow and expensive. Could scaling inference time compute help? In our latest work, we show that allowing models to spend test time compute to “self-study” a large corpora can >20x decode

Simran Arora (@simran_s_arora) 's Twitter Profile Photo

There’s been tons of work on KV-cache compression and KV-cache free Transformer-alternatives (SSMs, linear attention) models for long-context, but we know there’s no free lunch with these methods. The quality-memory tradeoffs are annoying. *Is all lost?* Introducing CARTRIDGES:

There’s been tons of work on KV-cache compression and KV-cache free Transformer-alternatives (SSMs, linear attention) models for long-context, but we know there’s no free lunch with these methods. The quality-memory tradeoffs are annoying. *Is all lost?* Introducing CARTRIDGES:
Simran Arora (@simran_s_arora) 's Twitter Profile Photo

Checkout CARTRIDGES, scaling cache-time compute! An alternative to ICL for settings where many different user messages reference the same large corpus of text!

𝚐𝔪𝟾𝚡𝚡𝟾 (@gm8xx8) 's Twitter Profile Photo

Cartridges: Storing long contexts in tiny caches with self-study - train-once, reusable memory via SELF-STUDY - 38.6× less memory, 26.4× higher throughput - extends context to 484k, composes across corpora - outperforms LoRA, DuoAttention, and standard ICL BLOG:

Cartridges: Storing long contexts in tiny caches with self-study

- train-once, reusable memory via SELF-STUDY
- 38.6× less memory, 26.4× higher throughput
- extends context to 484k, composes across corpora
- outperforms LoRA, DuoAttention, and standard ICL

BLOG:
Azalia Mirhoseini (@azaliamirh) 's Twitter Profile Photo

Very excited to share this new approach to long-context LLMs!! (matching ICL quality, but with 39x less KV cache memory and 26x higher peak throughput) The recipe: trade scaling offline inference-compute on the long context (via “self-study”) for compressed KV-cache memory (aka

Very excited to share this new approach to long-context LLMs!! (matching ICL quality, but with 39x less KV cache memory and 26x higher peak throughput)

The recipe: trade scaling offline inference-compute on the long context (via “self-study”) for compressed KV-cache memory (aka
Agent B (@michelivan92347) 's Twitter Profile Photo

Cartridges = an interesting offline alternative to regular ICL for frequently used large text corpora. 👇 A lot to learn in this awesome work imo. (another one from a Hazy Research team) Bravo to the team 👏

Cartridges = an interesting offline alternative to regular ICL for frequently used large text corpora. 👇

A lot to learn in this awesome work imo.
(another one from a Hazy Research team)

Bravo to the team 👏
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxestex) 's Twitter Profile Photo

I like this idea very much and have long advocated for something like this. Synthetically enriched «KV prefix» is a natural augment to modern long context models.

I like this idea very much and have long advocated for something like this. Synthetically enriched «KV prefix» is a natural augment to modern long context models.
Karan Goel (@krandiash) 's Twitter Profile Photo

Today we shipped a new real time API for streaming speech to text (a new family of models called Ink), that’s extremely fast, cheap and designed specifically for voice agents. We’re cooking hard, lots more releases coming soon 🧑‍🍳

Kawin Ethayarajh (@ethayarajh) 's Twitter Profile Photo

Trading online compute for offline compute is an under-discussed axis of scaling, but one that will be increasingly relevant going forward.

Charles Foster (@cfgeek) 's Twitter Profile Photo

Looks like a very slick way to tune and cheaply serve custom models! If I were building on this, I’d try to find a better way to initialize the cache. You can initialize LoRA as a no-op and let backprop handle the rest, but KV-tuning methods need weird initialization hacks.

Looks like a very slick way to tune and cheaply serve custom models!

If I were building on this, I’d try to find a better way to initialize the cache. You can initialize LoRA as a no-op and let backprop handle the rest, but KV-tuning methods need weird initialization hacks.
Jeremy Howard (@jeremyphoward) 's Twitter Profile Photo

Claude not able to continue my research chat about context compression papers because it ran out of context because it doesn't use context compression.

Claude not able to continue my research chat about context compression papers because it ran out of context because it doesn't use context compression.
Cartesia (@cartesia_ai) 's Twitter Profile Photo

👑 We’re #1! Sonic-2 leads @Labelbox’s Speech Generation Leaderboard topping out in speech quality, word error rate, and naturalness. Build your real-time voice apps with the 🥇 best voice AI model. ➡️ labelbox.com/leaderboards/s…

Infini-AI-Lab (@infiniailab) 's Twitter Profile Photo

🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% 🌐 Website: multiverse4fm.github.io 🧵 1/n

Siddharth Karamcheti (@siddkaramcheti) 's Twitter Profile Photo

Thrilled to share that I'll be starting as an Assistant Professor at Georgia Tech (Georgia Tech School of Interactive Computing / Robotics@GT / Machine Learning at Georgia Tech) in Fall 2026. My lab will tackle problems in robot learning, multimodal ML, and interaction. I'm recruiting PhD students this next cycle – please apply/reach out!

Thrilled to share that I'll be starting as an Assistant Professor at Georgia Tech (<a href="/ICatGT/">Georgia Tech School of Interactive Computing</a> / <a href="/GTrobotics/">Robotics@GT</a> / <a href="/mlatgt/">Machine Learning at Georgia Tech</a>) in Fall 2026.

My lab will tackle problems in robot learning, multimodal ML, and interaction. I'm recruiting PhD students this next cycle – please apply/reach out!
Jon Saad-Falcon (@jonsaadfalcon) 's Twitter Profile Photo

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 
🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning
Sanjana Srivastava (@sanjana__z) 's Twitter Profile Photo

🤖 Household robots are becoming physically viable. But interacting with people in the home requires handling unseen, unconstrained, dynamic preferences, not just a complex physical domain. We introduce ROSETTA: a method to generate reward for such preferences cheaply. 🧵⬇️

Jerry Liu (@jerrywliu) 's Twitter Profile Photo

1/10 ML can solve PDEs – but precision🔬is still a challenge. Towards high-precision methods for scientific problems, we introduce BWLer 🎳, a new architecture for physics-informed learning achieving (near-)machine-precision (up to 10⁻¹² RMSE) on benchmark PDEs. 🧵How it works: