Ganqu Cui (@charlesfornlp) 's Twitter Profile
Ganqu Cui

@charlesfornlp

PhD candidate at THUNLP

ID: 1636142849615089664

calendar_today15-03-2023 23:10:38

24 Tweet

88 Followers

55 Following

Ganqu Cui (@charlesfornlp) 's Twitter Profile Photo

In our test, llama3-instruct got insane scores on IFEval (8b: 74.1, 70b: 83.2, GPT-4: 79.7) which mean excellent instruction-following. 😯 However on MATH, we asked models to put the answer in \boxed{} and got some wired output 🤔 What would be the reason?

In our test, llama3-instruct got insane scores on IFEval (8b: 74.1, 70b: 83.2, GPT-4: 79.7) which mean excellent instruction-following. 😯 
However on MATH, we asked models to put the answer in \boxed{} and got some wired output 🤔
What would be the reason?
Lifan Yuan (@lifan__yuan) 's Twitter Profile Photo

Wanna train PRMs but process labels, annotated manually or automatically, sound too expensive to you😖? Introduce Implicit PRM🚀 – Get your model free process rewards by training an ORM on the cheaper response-level data, with a simple parameterization at no additional cost💰!

Wanna train PRMs but process labels, annotated manually or automatically, sound too expensive to you😖? 
Introduce Implicit PRM🚀 – Get your model free process rewards by training an ORM on the cheaper response-level data, with a simple parameterization at no additional cost💰!
Lifan Yuan (@lifan__yuan) 's Twitter Profile Photo

How to unlock advanced reasoning via scalable RL? 🚀Introducing PRIME (Process Reinforcement through Implicit Rewards) and Eurus-2, trained from Base model to surpass Qwen2.5-Math-Instruct using only 1/10 of the data. We're still scaling up - w/ 3x more training data to go! 🧵

How to unlock advanced reasoning via scalable RL?

🚀Introducing PRIME (Process Reinforcement through Implicit Rewards) and Eurus-2, trained from Base model to surpass Qwen2.5-Math-Instruct using only 1/10 of the data.

We're still scaling up - w/ 3x more training data to go! 🧵
Lifan Yuan (@lifan__yuan) 's Twitter Profile Photo

Kyle Corbitt Nathan Lambert That’s exactly our motivation! An offline RM may suffer from reward hacking, so we online update it with latest policy rollouts and ground truth labels in a scalable way. This is the key of PRIME. gt reward alone can go a long way, but online rm pushes the limits further.

<a href="/corbtt/">Kyle Corbitt</a> <a href="/natolambert/">Nathan Lambert</a> That’s exactly our motivation! An offline RM may suffer from reward hacking, so we online update it with latest policy rollouts and ground truth labels in a scalable way. This is the key of PRIME. gt reward alone can go a long way, but online rm pushes the limits further.
Ganqu Cui (@charlesfornlp) 's Twitter Profile Photo

The DeepSeek bros share the same view with us on PRM: If it can't **scale** with online updates, it has no place in RL. Implicit PRM & PRIME are exactly born for this!

Lifan Yuan (@lifan__yuan) 's Twitter Profile Photo

1/ PRIME is alive on arXiv💡! Building on our blog, we've added extensive experiments exploring: - Implicit PRM design choices - PRIME's integration with other RL algorithms - Value models vs. PRMs - RL from base models (“Zero”) See details below🧵 arxiv.org/abs/2502.01456