Wenting Zhao (@wzhao_nlp) Twitter Tweets • TwiCopy

Graham Neubig

5 months ago

Where does one language model outperform the other? We examine this from first principles, performing unsupervised discovery of "abilities" that one model has and the other does not. Results show interesting differences between model classes, sizes and pre-/post-training.

thumb_up_off_alt80

chat_bubble_outline0

repeat11

shareShare

Alex Dimakis

@alexgdimakis

5 months ago

There are still posts about 'new papers showing AI models cannot reason'. There are unfortunately problems into how these evaluations were done and also many of those limitations are known, peer-reviewed and published. Here is a simplified version of what's going on as far as I

thumb_up_off_alt146

chat_bubble_outline7

repeat19

shareShare

Wenting Zhao

@wzhao_nlp

4 months ago

That’s the vision of commit0: github.com/commit-0/commi… there is nearly zero improvement on this benchmark in the past few months. I don’t think this problem is solvable in 24 months…

thumb_up_off_alt19

chat_bubble_outline1

repeat1

shareShare

Wenting Zhao

@wzhao_nlp

4 months ago

The more I dive into LM training, the more I feel pretraining is just starting. Some questions I’m particularly interested in: * what data unlocks what capabilities? * do we train on capabilities sequentially or in parallel? * how many synthetic examples is a human example worth?

thumb_up_off_alt334

chat_bubble_outline8

repeat27

shareShare

Wenting Zhao

@wzhao_nlp

4 months ago

It's time to think about code generation beyond functional correctness. Refactoring multiple libraries requires designing APIs that support past and future use cases, which is challenging for even human engineers. Can't wait for LLMs to unify pytorch, tensorflow, and jax 😬

thumb_up_off_alt45

chat_bubble_outline1

repeat4

shareShare

Wenting Zhao

@wzhao_nlp

4 months ago

LM training bottlenecks 2024: code RL -> code execution is slower than model inference 2025: reasoning model RL -> rolling out 32k tokens takes forever maybe diffusion models are indeed the solution lol

thumb_up_off_alt109

chat_bubble_outline2

repeat1

shareShare

NovaSky

@novaskyai

4 months ago

✨Release: We upgraded SkyRL into a highly-modular, performant RL framework for training LLMs. We prioritized modularity—easily prototype new algorithms, environments, and training logic with minimal overhead. 🧵👇 Blog: novasky-ai.notion.site/skyrl-v01 Code: github.com/NovaSky-AI/Sky…

thumb_up_off_alt202

chat_bubble_outline2

repeat43

shareShare

Wenting Zhao

@wzhao_nlp

4 months ago

Dang, truly impressed by how an academic lab just figured out a lot of mysteries in mid-training to close the RL gap between llama and qwen: * length scheduler plays a key role to stabilize RL * there is some dark magic in prompt template? * the data interaction stuff is really

thumb_up_off_alt196

chat_bubble_outline3

repeat16

shareShare

Wenting Zhao

@wzhao_nlp

4 months ago

Congrats to team! They built my dream benchmark.

thumb_up_off_alt11

chat_bubble_outline0

repeat0

shareShare

Jason Wei

@_jasonwei

4 months ago

We don’t have AI self-improves yet, and when we do it will be a game-changer. With more wisdom now compared to the GPT-4 days, it's obvious that it will not be a “fast takeoff”, but rather extremely gradual across many years, probably a decade. The first thing to know is that

thumb_up_off_alt1,1K

chat_bubble_outline78

repeat161

shareShare

Michael Hu

@michahu8

4 months ago

📢 today's scaling laws often don't work for predicting downstream task performance. For some pretraining setups, smooth and predictable scaling is the exception, not the rule. a quick read about scaling law fails: 📜arxiv.org/abs/2507.00885 🧵1/5👇

thumb_up_off_alt279

chat_bubble_outline4

repeat36

shareShare

Ori Press

@ori_press

4 months ago

Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

thumb_up_off_alt141

chat_bubble_outline6

repeat54

shareShare

Yoram Bachrach

@yorambac

4 months ago

AI Research Agents are becoming proficient at machine learning tasks, but how can we help them search the space of candidate solutions and codebases? Read our new paper looking at MLE-Bench: arxiv.org/pdf/2507.02554 #LLM #Agents #MLEBench

thumb_up_off_alt251

chat_bubble_outline5

repeat49

shareShare

Justin T Chiu

@justintchiu

3 months ago

haven't made a new blog post in over a year, so here's a new one: justintchiu.com/blog/sftrl/ it's short

thumb_up_off_alt177

chat_bubble_outline3

repeat22

shareShare

Wenting Zhao

@wzhao_nlp

3 months ago

I'll be around the ICML venue this afternoon. Message me if you want to meet! These days, I think about reasoning and RL. Also happy to talk about academia vs. industry (I think the lack of compute in academia is a feature not a bug), faculty and PhD student recruiting at UMass.

thumb_up_off_alt118

chat_bubble_outline0

repeat5

shareShare

Eric Zelikman

@ericzelikman

3 months ago

i've been thinking lately about how future ai systems will interact with us and how we can make systems that care about people and wanted to put words to it -- hopefully it resonates a bit!

thumb_up_off_alt259

chat_bubble_outline19

repeat14

shareShare

Yuntian Deng

@yuntiandeng

3 months ago

🚀New dataset release: WildChat-4.8M 4.8M real user-ChatGPT conversations collected from our public chatbots: - 122K from reasoning models (o1-preview, o1-mini): represent real uses in the wild and very costly to collect - 2.5M from GPT-4o 🔗 hf.co/datasets/allen… (1/4)

thumb_up_off_alt241

chat_bubble_outline5

repeat46

shareShare

Wenting Zhao

@wzhao_nlp

3 months ago

wow, it's absolutely amazing to see in a more contamination-free setting, qwen is even better than claude. big W for open-source models too

thumb_up_off_alt15

chat_bubble_outline1

repeat0

shareShare