Mayee Chen (@mayeechen) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Online data mixing reduces training costs for foundation models, but faces challenges: ⚠️ Human-defined domains miss semantic nuances ⚠️ Limited eval accessibility ⚠️ Poor scalability Introducing 🎵R&B: first regroup data, then dynamically reweight domains during training!

thumb_up_off_alt93

chat_bubble_outline2

repeat19

shareShare

Ludwig Schmidt

@lschmidt3

a month ago

Very excited to finally release our paper for OpenThoughts! After DataComp and DCLM, this is the third large open dataset my group has been building in collaboration with the DataComp community. This time, the focus is on post-training, specifically reasoning data.

thumb_up_off_alt1,1K

chat_bubble_outline20

repeat208

shareShare

Sabri Eyuboglu

@eyuboglusabri

a month ago

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x

thumb_up_off_alt287

chat_bubble_outline12

repeat66

shareShare

Hermann

@kumbonghermann

a month ago

Excited to be presenting our new work–HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation– at #CVPR2025 this week. VAR (Visual Autoregressive Modelling) introduced a very nice way to formulate autoregressive image generation as a next-scale prediction task (from

thumb_up_off_alt49

chat_bubble_outline1

repeat21

shareShare

Ai2

@allen_ai

a month ago

We are #1 on the Hugging Face heatmap - this is what true openness looks like!🥇🎉 750+ models 230+ datasets And counting... Come build with us huggingface.co/spaces/cfahlgr…

thumb_up_off_alt169

chat_bubble_outline8

repeat27

shareShare

Thao Nguyen

@thao_nguyen26

24 days ago

Web data, the “fossil fuel of AI”, is being exhausted. What’s next?🤔 We propose Recycling the Web to break the data wall of pretraining via grounded synthetic data. It is more effective than standard data filtering methods, even with multi-epoch repeats! arxiv.org/abs/2506.04689

thumb_up_off_alt213

chat_bubble_outline8

repeat57

shareShare

Tanishq Mathew Abraham, Ph.D.

@iscienceluvr

23 days ago

Shrinking the Generation-Verification Gap with Weak Verifiers "we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers." "Weaver leverages weak supervision to estimate each verifier’s accuracy and combines their outputs

thumb_up_off_alt124

chat_bubble_outline3

repeat24

shareShare

Nouha Dziri

@nouhadziri

23 days ago

📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬 We built a benchmark to find out → OMEGA Ω 📐 💥 We found

thumb_up_off_alt714

chat_bubble_outline22

repeat157

shareShare

Jon Saad-Falcon

@jonsaadfalcon

23 days ago

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning

thumb_up_off_alt204

chat_bubble_outline11

repeat56

shareShare

Kelly Buchanan

@ekellbuch

23 days ago

LLMs can generate 100 answers, but which one is right? Check out our latest work closing the generation-verification gap by aggregating weak verifiers and distilling them into a compact 400M model. If this direction is exciting to you, we’d love to connect.

thumb_up_off_alt44

chat_bubble_outline1

repeat14

shareShare

Dan Fu

@realdanfu

23 days ago

What a throwback to weak supervision! Great work Jon Saad-Falcon Kelly Buchanan Mayee Chen!

thumb_up_off_alt24

chat_bubble_outline1

repeat7

shareShare

Dan Biderman

@dan_biderman

23 days ago

This important paper proposes: 1. Sample many candidate LLM generations 2. Score each candidate with a collection of imperfect verifiers 3. Fuse these imperfect signals into a single latent score (weak-supervision tricks) 4. Pick the candidate with the highest score Congrats

thumb_up_off_alt20

chat_bubble_outline0

repeat2

shareShare

Alex Ratner

@ajratner

23 days ago

Very exciting work on using weak supervision for RL- closing the “generation-verification gap”!! Once again- principled approaches to labeling/data development are the keys!

thumb_up_off_alt20

chat_bubble_outline1

repeat7

shareShare

Christopher Agia

@agiachris

23 days ago

What makes data “good” for robot learning? We argue: it’s the data that drives closed-loop policy success! Introducing CUPID 💘, a method that curates demonstrations not by "quality" or appearance, but by how they influence policy behavior, using influence functions. (1/6)

thumb_up_off_alt104

chat_bubble_outline5

repeat15

shareShare

Unseen Japan

@unseenjapansite

22 days ago

Not Japan-related, but since we all need a distraction from The Horrors, Takaya Suzuki points out a study that examined 408 sleeping cats and found the majority (65%) curl leftwards. I'm not sure how useful this information is, but...it's yours now.

thumb_up_off_alt1,1K

chat_bubble_outline10

repeat543

shareShare

Azalia Mirhoseini

@azaliamirh

21 days ago

Introducing Weaver, a test time scaling method for verification! Weaver shrinks the generation-verification gap through a low-overhead weak-to-strong optimization of a mixture of verifiers (e.g., LM judges and reward models). The Weavered mixture can be distilled into a tiny

thumb_up_off_alt169

chat_bubble_outline2

repeat35

shareShare

Ai2

@allen_ai

16 days ago

Introducing SciArena, a platform for benchmarking models across scientific literature tasks. Inspired by Chatbot Arena, SciArena applies a crowdsourced LLM evaluation approach to the scientific domain. 🧵

thumb_up_off_alt381

chat_bubble_outline12

repeat63

shareShare

Valentina Pyatkin

@valentina__py

14 days ago

💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR. But the set of constraints and verifier functions is limited and most models overfit on IFEval. We introduce IFBench to measure model generalization to unseen constraints.

thumb_up_off_alt347

chat_bubble_outline5

repeat89

shareShare

Mayee Chen

Gate.io

Albert Ge

Ludwig Schmidt

Sabri Eyuboglu

Hermann

Ai2

Thao Nguyen

Tanishq Mathew Abraham, Ph.D.

Nouha Dziri

Jon Saad-Falcon

Kelly Buchanan

Dan Fu

Dan Biderman

Alex Ratner

Christopher Agia

Unseen Japan

Azalia Mirhoseini

Ai2

Valentina Pyatkin