Jon Saad-Falcon (@jonsaadfalcon) 's Twitter Profile
Jon Saad-Falcon

@jonsaadfalcon

AI PhD @hazyresearch @StanfordAILab | Previously @databricks @allen_ai @GeorgiaTech

ID: 1346593521059426304

linkhttps://jonsaadfalcon.com/ calendar_today05-01-2021 23:04:54

124 Tweet

824 Followers

453 Following

Sabri Eyuboglu (@eyuboglusabri) 's Twitter Profile Photo

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size.

What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x
Rylan Schaeffer (@rylanschaeffer) 's Twitter Profile Photo

🚨New preprint 🚨 Turning Down the Heat: A Critical Analysis of Min-p Sampling in Language Models We examine min-p sampling (ICLR 2025 oral) & find significant problems in all 4 lines of evidence: human eval, NLP evals, LLM-as-judge evals, community adoption claims 1/8

🚨New preprint 🚨

Turning Down the Heat: A Critical Analysis of Min-p Sampling in Language Models

We examine min-p sampling (ICLR 2025 oral) & find significant problems in all 4 lines of evidence: human eval, NLP evals, LLM-as-judge evals, community adoption claims

1/8
William Berrios (@w33lliam) 's Twitter Profile Photo

Excited to share 🤯 that our LMUnit models with Contextual AI just claimed the top spots on RewardBench2 🥇 How did we manage to rank +5% higher than models like Gemini, Claude 4, and GPT4.1? More in the details below: 🧵 1/11

Excited to share 🤯 that our LMUnit models with <a href="/ContextualAI/">Contextual AI</a> just claimed the top spots on RewardBench2 🥇

How did we manage to rank +5% higher than models like Gemini, Claude 4, and GPT4.1? More in the details below:

🧵 1/11
Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

Shrinking the Generation-Verification Gap with Weak Verifiers "we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers." "Weaver leverages weak supervision to estimate each verifier’s accuracy and combines their outputs

Shrinking the Generation-Verification Gap with Weak Verifiers

"we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers."

"Weaver leverages weak supervision to estimate each verifier’s accuracy and combines their outputs
Kelly Buchanan (@ekellbuch) 's Twitter Profile Photo

LLMs can generate 100 answers, but which one is right? Check out our latest work closing the generation-verification gap by aggregating weak verifiers and distilling them into a compact 400M model. If this direction is exciting to you, we’d love to connect.

Alex Ratner (@ajratner) 's Twitter Profile Photo

Very exciting work on using weak supervision for RL- closing the “generation-verification gap”!! Once again- principled approaches to labeling/data development are the keys!

Mayee Chen (@mayeechen) 's Twitter Profile Photo

LLMs often generate correct answers but struggle to select them. Weaver tackles this by combining many weak verifiers (reward models, LM judges) into a stronger signal using statistical tools from Weak Supervision—matching o3-mini-level accuracy with much cheaper models! 📊

LLMs often generate correct answers but struggle to select them. Weaver tackles this by combining many weak verifiers (reward models, LM judges) into a stronger signal using statistical tools from Weak Supervision—matching o3-mini-level accuracy with much cheaper models! 📊
Leonard Tang (@leonardtang_) 's Twitter Profile Photo

Verdict systems can now judge image inputs. Score product photos. Ad creatives. UI mockups. Haize anime birds. Judge any thing for any quality—and understand why.

Verdict systems can now judge image inputs.

Score product photos. Ad creatives. UI mockups. Haize anime birds.

Judge any thing for any quality—and understand why.
Azalia Mirhoseini (@azaliamirh) 's Twitter Profile Photo

Introducing Weaver, a test time scaling method for verification! Weaver shrinks the generation-verification gap through a low-overhead weak-to-strong optimization of a mixture of verifiers (e.g., LM judges and reward models). The Weavered mixture can be distilled into a tiny

Introducing Weaver, a test time scaling method for verification! 

Weaver shrinks the generation-verification gap through a low-overhead weak-to-strong  optimization of a mixture of verifiers (e.g., LM judges and reward models). The Weavered mixture can be distilled into a tiny
Azalia Mirhoseini (@azaliamirh) 's Twitter Profile Photo

See Jon Saad-Falcon's post for more details: x.com/JonSaadFalcon/… Paper: arxiv.org/abs/2506.18203 Blog: hazyresearch.stanford.edu/blog/2025-06-1… github.com/HazyResearch/s…… Datasets and Models: huggingface.co/collections/ha…