Simran Arora (@simran_s_arora) Twitter Tweets • TwiCopy

Neel Guha

6 months ago

Really cool work led by Sabri Eyuboglu Ryan Ehrlich Simran Arora! The idea of training a "cartridge" which represents the knowledge in a document (or corpora), and can be slotted into LLMs to support engagement has tons of applications/practical importance for law (1/4)

thumb_up_off_alt16

chat_bubble_outline1

repeat2

shareShare

Ryan Ehrlich

@ryansehrlich

6 months ago

Giving LLMs very large amounts of context can be really useful, but it can also be slow and expensive. Could scaling inference time compute help? In our latest work, we show that allowing models to spend test time compute to “self-study” a large corpora can >20x decode

thumb_up_off_alt33

chat_bubble_outline0

repeat7

shareShare

Simran Arora

@simran_s_arora

6 months ago

There’s been tons of work on KV-cache compression and KV-cache free Transformer-alternatives (SSMs, linear attention) models for long-context, but we know there’s no free lunch with these methods. The quality-memory tradeoffs are annoying. *Is all lost?* Introducing CARTRIDGES:

thumb_up_off_alt297

chat_bubble_outline5

repeat32

shareShare

Simran Arora

@simran_s_arora

6 months ago

Checkout CARTRIDGES, scaling cache-time compute! An alternative to ICL for settings where many different user messages reference the same large corpus of text!

thumb_up_off_alt38

chat_bubble_outline0

repeat4

shareShare

𝚐𝔪𝟾𝚡𝚡𝟾

@gm8xx8

6 months ago

Cartridges: Storing long contexts in tiny caches with self-study - train-once, reusable memory via SELF-STUDY - 38.6× less memory, 26.4× higher throughput - extends context to 484k, composes across corpora - outperforms LoRA, DuoAttention, and standard ICL BLOG:

thumb_up_off_alt189

chat_bubble_outline4

repeat29

shareShare

Azalia Mirhoseini

@azaliamirh

6 months ago

Very excited to share this new approach to long-context LLMs!! (matching ICL quality, but with 39x less KV cache memory and 26x higher peak throughput) The recipe: trade scaling offline inference-compute on the long context (via “self-study”) for compressed KV-cache memory (aka

thumb_up_off_alt292

chat_bubble_outline2

repeat40

shareShare

Agent B

@michelivan92347

6 months ago

Cartridges = an interesting offline alternative to regular ICL for frequently used large text corpora. 👇 A lot to learn in this awesome work imo. (another one from a Hazy Research team) Bravo to the team 👏

thumb_up_off_alt12

chat_bubble_outline1

repeat2

shareShare

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

@teortaxestex

6 months ago

I like this idea very much and have long advocated for something like this. Synthetically enriched «KV prefix» is a natural augment to modern long context models.

thumb_up_off_alt167

chat_bubble_outline3

repeat16

shareShare

theseriousadult

@gallabytes

6 months ago

more evidence that kv caches have a lot of room for compression.

thumb_up_off_alt14

chat_bubble_outline1

repeat2

shareShare

Karan Goel

@krandiash

6 months ago

Today we shipped a new real time API for streaming speech to text (a new family of models called Ink), that’s extremely fast, cheap and designed specifically for voice agents. We’re cooking hard, lots more releases coming soon 🧑‍🍳

thumb_up_off_alt89

chat_bubble_outline4

repeat13

shareShare

Kawin Ethayarajh

@ethayarajh

6 months ago

Trading online compute for offline compute is an under-discussed axis of scaling, but one that will be increasingly relevant going forward.

thumb_up_off_alt19

chat_bubble_outline1

repeat2

shareShare

Charles Foster

@cfgeek

5 months ago

Looks like a very slick way to tune and cheaply serve custom models! If I were building on this, I’d try to find a better way to initialize the cache. You can initialize LoRA as a no-op and let backprop handle the rest, but KV-tuning methods need weird initialization hacks.

thumb_up_off_alt16

chat_bubble_outline1

repeat2

shareShare

Jeremy Howard

@jeremyphoward

5 months ago

Claude not able to continue my research chat about context compression papers because it ran out of context because it doesn't use context compression.

thumb_up_off_alt628

chat_bubble_outline32

repeat27

shareShare

Cartesia

@cartesia_ai

5 months ago

👑 We’re #1! Sonic-2 leads @Labelbox’s Speech Generation Leaderboard topping out in speech quality, word error rate, and naturalness. Build your real-time voice apps with the 🥇 best voice AI model. ➡️ labelbox.com/leaderboards/s…

thumb_up_off_alt31

chat_bubble_outline0

repeat8

shareShare

Geoffrey Angus

@geoffreyangus

5 months ago

Struggling with context management? Wish you could just stick it all in your model? We’ve integrated Cartridges, a new method of leveraging sleep-time compute for learning long contexts, into Tokasaurus, an inference engine optimized for high-throughput 🧵

thumb_up_off_alt39

chat_bubble_outline1

repeat10

shareShare

Infini-AI-Lab

@infiniailab

5 months ago

🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% 🌐 Website: multiverse4fm.github.io 🧵 1/n

thumb_up_off_alt207

chat_bubble_outline2

repeat76

shareShare

Siddharth Karamcheti

@siddkaramcheti

5 months ago

Thrilled to share that I'll be starting as an Assistant Professor at Georgia Tech (Georgia Tech School of Interactive Computing / Robotics@GT / Machine Learning at Georgia Tech) in Fall 2026. My lab will tackle problems in robot learning, multimodal ML, and interaction. I'm recruiting PhD students this next cycle – please apply/reach out!

Thrilled to share that I'll be starting as an Assistant Professor at Georgia Tech (<a href="/ICatGT/">Georgia Tech School of Interactive Computing</a> / <a href="/GTrobotics/">Robotics@GT</a> / <a href="/mlatgt/">Machine Learning at Georgia Tech</a>) in Fall 2026.

My lab will tackle problems in robot learning, multimodal ML, and interaction. I'm recruiting PhD students this next cycle – please apply/reach out!

thumb_up_off_alt492

chat_bubble_outline60

repeat26

shareShare

Jon Saad-Falcon

@jonsaadfalcon

5 months ago

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning

thumb_up_off_alt204

chat_bubble_outline11

repeat56

shareShare

Sanjana Srivastava

@sanjana__z

5 months ago

🤖 Household robots are becoming physically viable. But interacting with people in the home requires handling unseen, unconstrained, dynamic preferences, not just a complex physical domain. We introduce ROSETTA: a method to generate reward for such preferences cheaply. 🧵⬇️

thumb_up_off_alt128

chat_bubble_outline4

repeat27

shareShare

Jerry Liu

@jerrywliu

5 months ago

1/10 ML can solve PDEs – but precision🔬is still a challenge. Towards high-precision methods for scientific problems, we introduce BWLer 🎳, a new architecture for physics-informed learning achieving (near-)machine-precision (up to 10⁻¹² RMSE) on benchmark PDEs. 🧵How it works:

thumb_up_off_alt579

chat_bubble_outline12

repeat109

shareShare