Mayee Chen (@mayeechen) 's Twitter Profile
Mayee Chen

@mayeechen

CS PhD student @StanfordAILab @HazyResearch, undergrad @princeton. Working on all things data! she/her 🎃

ID: 1227540343177928704

linkhttp://mayeechen.github.io calendar_today12-02-2020 10:30:09

433 Tweet

1,1K Followers

575 Following

Albert Ge (@albert_ge_95) 's Twitter Profile Photo

Online data mixing reduces training costs for foundation models, but faces challenges: ⚠️ Human-defined domains miss semantic nuances ⚠️ Limited eval accessibility ⚠️ Poor scalability Introducing 🎵R&B: first regroup data, then dynamically reweight domains during training!

Online data mixing reduces training costs for foundation models, but faces challenges:
⚠️ Human-defined domains miss semantic nuances
⚠️ Limited eval accessibility
⚠️ Poor scalability

Introducing 🎵R&B: first regroup data, then dynamically reweight domains during training!
Ludwig Schmidt (@lschmidt3) 's Twitter Profile Photo

Very excited to finally release our paper for OpenThoughts! After DataComp and DCLM, this is the third large open dataset my group has been building in collaboration with the DataComp community. This time, the focus is on post-training, specifically reasoning data.

Very excited to finally release our paper for OpenThoughts!

After DataComp and DCLM, this is the third large open dataset my group has been building in collaboration with the DataComp community. This time, the focus is on post-training, specifically reasoning data.
Sabri Eyuboglu (@eyuboglusabri) 's Twitter Profile Photo

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size.

What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x
Hermann (@kumbonghermann) 's Twitter Profile Photo

Excited to be presenting our new work–HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation– at #CVPR2025 this week. VAR (Visual Autoregressive Modelling) introduced a very nice way to formulate autoregressive image generation as a next-scale prediction task (from

Excited to be presenting our new work–HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation– at #CVPR2025 this week.

VAR (Visual Autoregressive Modelling) introduced a very nice way to formulate autoregressive image generation as a next-scale prediction task (from
Ai2 (@allen_ai) 's Twitter Profile Photo

We are #1 on the Hugging Face heatmap - this is what true openness looks like!🥇🎉 750+ models 230+ datasets And counting... Come build with us huggingface.co/spaces/cfahlgr…

Thao Nguyen (@thao_nguyen26) 's Twitter Profile Photo

Web data, the “fossil fuel of AI”, is being exhausted. What’s next?🤔 We propose Recycling the Web to break the data wall of pretraining via grounded synthetic data. It is more effective than standard data filtering methods, even with multi-epoch repeats! arxiv.org/abs/2506.04689

Web data, the “fossil fuel of AI”, is being exhausted. What’s next?🤔
We propose Recycling the Web to break the data wall of pretraining via grounded synthetic data. It is more effective than standard data filtering methods, even with multi-epoch repeats!

arxiv.org/abs/2506.04689
Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

Shrinking the Generation-Verification Gap with Weak Verifiers "we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers." "Weaver leverages weak supervision to estimate each verifier’s accuracy and combines their outputs

Shrinking the Generation-Verification Gap with Weak Verifiers

"we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers."

"Weaver leverages weak supervision to estimate each verifier’s accuracy and combines their outputs
Nouha Dziri (@nouhadziri) 's Twitter Profile Photo

📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬 We built a benchmark to find out → OMEGA Ω 📐 💥 We found

📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? 

Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬

 We built a benchmark to find out → OMEGA Ω 📐

💥 We found
Jon Saad-Falcon (@jonsaadfalcon) 's Twitter Profile Photo

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 
🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning
Kelly Buchanan (@ekellbuch) 's Twitter Profile Photo

LLMs can generate 100 answers, but which one is right? Check out our latest work closing the generation-verification gap by aggregating weak verifiers and distilling them into a compact 400M model. If this direction is exciting to you, we’d love to connect.

Dan Biderman (@dan_biderman) 's Twitter Profile Photo

This important paper proposes: 1. Sample many candidate LLM generations 2. Score each candidate with a collection of imperfect verifiers 3. Fuse these imperfect signals into a single latent score (weak-supervision tricks) 4. Pick the candidate with the highest score Congrats

Alex Ratner (@ajratner) 's Twitter Profile Photo

Very exciting work on using weak supervision for RL- closing the “generation-verification gap”!! Once again- principled approaches to labeling/data development are the keys!

Christopher Agia (@agiachris) 's Twitter Profile Photo

What makes data “good” for robot learning? We argue: it’s the data that drives closed-loop policy success! Introducing CUPID 💘, a method that curates demonstrations not by "quality" or appearance, but by how they influence policy behavior, using influence functions. (1/6)

Unseen Japan (@unseenjapansite) 's Twitter Profile Photo

Not Japan-related, but since we all need a distraction from The Horrors, Takaya Suzuki points out a study that examined 408 sleeping cats and found the majority (65%) curl leftwards. I'm not sure how useful this information is, but...it's yours now.

Not Japan-related, but since we all need a distraction from The Horrors, Takaya Suzuki points out a study that examined 408 sleeping cats and found the majority (65%) curl leftwards. 

I'm not sure how useful this information is, but...it's yours now.
Azalia Mirhoseini (@azaliamirh) 's Twitter Profile Photo

Introducing Weaver, a test time scaling method for verification! Weaver shrinks the generation-verification gap through a low-overhead weak-to-strong optimization of a mixture of verifiers (e.g., LM judges and reward models). The Weavered mixture can be distilled into a tiny

Introducing Weaver, a test time scaling method for verification! 

Weaver shrinks the generation-verification gap through a low-overhead weak-to-strong  optimization of a mixture of verifiers (e.g., LM judges and reward models). The Weavered mixture can be distilled into a tiny
Ai2 (@allen_ai) 's Twitter Profile Photo

Introducing SciArena, a platform for benchmarking models across scientific literature tasks. Inspired by Chatbot Arena, SciArena applies a crowdsourced LLM evaluation approach to the scientific domain. 🧵

Introducing SciArena, a platform for benchmarking models across scientific literature tasks. Inspired by Chatbot Arena, SciArena applies a crowdsourced LLM evaluation approach to the scientific domain. 🧵
Valentina Pyatkin (@valentina__py) 's Twitter Profile Photo

💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR. But the set of constraints and verifier functions is limited and most models overfit on IFEval. We introduce IFBench to measure model generalization to unseen constraints.

💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR.
But the set of constraints and verifier functions is limited and most models overfit on IFEval.
We introduce IFBench to measure model generalization to unseen constraints.