Ganqu Cui (@charlesfornlp) Twitter Tweets • TwiCopy

Ganqu Cui

a year ago

In our test, llama3-instruct got insane scores on IFEval (8b: 74.1, 70b: 83.2, GPT-4: 79.7) which mean excellent instruction-following. 😯 However on MATH, we asked models to put the answer in \boxed{} and got some wired output 🤔 What would be the reason?

$In our test, llama3-instruct got insane scores on IFEval (8b: 74.1, 70b: 83.2, GPT-4: 79.7) which mean excellent instruction-following. 😯 However on MATH, we asked models to put the answer in \boxed{} and got some wired output 🤔 What would be the reason?$

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Lifan Yuan

@lifan__yuan

9 months ago

Wanna train PRMs but process labels, annotated manually or automatically, sound too expensive to you😖? Introduce Implicit PRM🚀 – Get your model free process rewards by training an ORM on the cheaper response-level data, with a simple parameterization at no additional cost💰!

thumb_up_off_alt212

chat_bubble_outline3

repeat48

shareShare

Lifan Yuan

@lifan__yuan

8 months ago

How to unlock advanced reasoning via scalable RL? 🚀Introducing PRIME (Process Reinforcement through Implicit Rewards) and Eurus-2, trained from Base model to surpass Qwen2.5-Math-Instruct using only 1/10 of the data. We're still scaling up - w/ 3x more training data to go! 🧵

thumb_up_off_alt1,1K

chat_bubble_outline18

repeat175

shareShare

Lifan Yuan

@lifan__yuan

8 months ago

Kyle Corbitt Nathan Lambert That’s exactly our motivation! An offline RM may suffer from reward hacking, so we online update it with latest policy rollouts and ground truth labels in a scalable way. This is the key of PRIME. gt reward alone can go a long way, but online rm pushes the limits further.

<a href="/corbtt/">Kyle Corbitt</a> <a href="/natolambert/">Nathan Lambert</a> That’s exactly our motivation! An offline RM may suffer from reward hacking, so we online update it with latest policy rollouts and ground truth labels in a scalable way. This is the key of PRIME. gt reward alone can go a long way, but online rm pushes the limits further.

thumb_up_off_alt21

chat_bubble_outline1

repeat3

shareShare

Ganqu Cui

@charlesfornlp

8 months ago

The DeepSeek bros share the same view with us on PRM: If it can't **scale** with online updates, it has no place in RL. Implicit PRM & PRIME are exactly born for this!

thumb_up_off_alt11

chat_bubble_outline0

repeat2

shareShare

Ganqu Cui

@charlesfornlp

8 months ago

PRIME adopted by Prime!

thumb_up_off_alt6

chat_bubble_outline1

repeat1

shareShare

AK

@_akhaliq

7 months ago

Process Reinforcement through Implicit Rewards

thumb_up_off_alt83

chat_bubble_outline1

repeat18

shareShare

Lifan Yuan

@lifan__yuan

7 months ago

1/ PRIME is alive on arXiv💡! Building on our blog, we've added extensive experiments exploring: - Implicit PRM design choices - PRIME's integration with other RL algorithms - Value models vs. PRMs - RL from base models (“Zero”) See details below🧵 arxiv.org/abs/2502.01456

thumb_up_off_alt110

chat_bubble_outline3

repeat16

shareShare