Jon Saad-Falcon (@jonsaadfalcon) Twitter Tweets • TwiCopy

Jon Saad-Falcon

@jonsaadfalcon

+ Follow

AI PhD @hazyresearch @StanfordAILab | Previously @databricks @allen_ai @GeorgiaTech

ID: 1346593521059426304

linkhttps://jonsaadfalcon.com/ calendar_today05-01-2021 23:04:54

124 Tweet

824 Followers

453 Following

Sabri Eyuboglu

@eyuboglusabri

5 months ago

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x

thumb_up_off_alt287

chat_bubble_outline12

repeat66

shareShare

Rylan Schaeffer

@rylanschaeffer

5 months ago

🚨New preprint 🚨 Turning Down the Heat: A Critical Analysis of Min-p Sampling in Language Models We examine min-p sampling (ICLR 2025 oral) & find significant problems in all 4 lines of evidence: human eval, NLP evals, LLM-as-judge evals, community adoption claims 1/8

thumb_up_off_alt285

chat_bubble_outline12

repeat35

shareShare

William Berrios

@w33lliam

5 months ago

Excited to share 🤯 that our LMUnit models with Contextual AI just claimed the top spots on RewardBench2 🥇 How did we manage to rank +5% higher than models like Gemini, Claude 4, and GPT4.1? More in the details below: 🧵 1/11

Excited to share 🤯 that our LMUnit models with <a href="/ContextualAI/">Contextual AI</a> just claimed the top spots on RewardBench2 🥇

How did we manage to rank +5% higher than models like Gemini, Claude 4, and GPT4.1? More in the details below:

🧵 1/11

thumb_up_off_alt115

chat_bubble_outline6

repeat22

shareShare

Tanishq Mathew Abraham, Ph.D.

@iscienceluvr

5 months ago

Shrinking the Generation-Verification Gap with Weak Verifiers "we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers." "Weaver leverages weak supervision to estimate each verifier’s accuracy and combines their outputs

thumb_up_off_alt124

chat_bubble_outline3

repeat24

shareShare

Kelly Buchanan

@ekellbuch

5 months ago

LLMs can generate 100 answers, but which one is right? Check out our latest work closing the generation-verification gap by aggregating weak verifiers and distilling them into a compact 400M model. If this direction is exciting to you, we’d love to connect.

thumb_up_off_alt44

chat_bubble_outline1

repeat14

shareShare

Dan Fu

@realdanfu

5 months ago

What a throwback to weak supervision! Great work Jon Saad-Falcon Kelly Buchanan Mayee Chen!

thumb_up_off_alt24

chat_bubble_outline1

repeat7

shareShare

Alex Ratner

@ajratner

5 months ago

Very exciting work on using weak supervision for RL- closing the “generation-verification gap”!! Once again- principled approaches to labeling/data development are the keys!

thumb_up_off_alt20

chat_bubble_outline1

repeat7

shareShare

Mayee Chen

@mayeechen

5 months ago

LLMs often generate correct answers but struggle to select them. Weaver tackles this by combining many weak verifiers (reward models, LM judges) into a stronger signal using statistical tools from Weak Supervision—matching o3-mini-level accuracy with much cheaper models! 📊

thumb_up_off_alt223

chat_bubble_outline15

repeat33

shareShare

Leonard Tang

@leonardtang_

5 months ago

Verdict systems can now judge image inputs. Score product photos. Ad creatives. UI mockups. Haize anime birds. Judge any thing for any quality—and understand why.

thumb_up_off_alt44

chat_bubble_outline3

repeat10

shareShare

Azalia Mirhoseini

@azaliamirh

5 months ago

Introducing Weaver, a test time scaling method for verification! Weaver shrinks the generation-verification gap through a low-overhead weak-to-strong optimization of a mixture of verifiers (e.g., LM judges and reward models). The Weavered mixture can be distilled into a tiny

thumb_up_off_alt169

chat_bubble_outline2

repeat35

shareShare

Azalia Mirhoseini

@azaliamirh

5 months ago

See Jon Saad-Falcon's post for more details: x.com/JonSaadFalcon/… Paper: arxiv.org/abs/2506.18203 Blog: hazyresearch.stanford.edu/blog/2025-06-1… github.com/HazyResearch/s…… Datasets and Models: huggingface.co/collections/ha…

thumb_up_off_alt12

chat_bubble_outline0

repeat3

shareShare