Sagnik Mukherjee (@saagnikkk) Twitter Tweets • TwiCopy

The most exciting finding I learned recently! RL intrinsically leads to sparse updates, while SFT updates densely. We are still investigating if the updated gradients relate to “critical” params. Hope our findings help better understand RL and motivate thoughts on efficiency

thumb_up_off_alt168

chat_bubble_outline1

repeat24

shareShare

Sagnik Mukherjee

@saagnikkk

2 months ago

LLMs have opened a great variety of research topics and this survey on persuasion by my awesome lab mate Beyza Bozdag is definitely a must read 😉

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

Shivam Agarwal

@shivamag12

2 months ago

Can entropy minimization alone improve LLM performance? And how far can they go without any labeled data? This work answers both: yes, and surprisingly far 🐮 At inference EM can beat GPT4o Claude 3 opus & Gemini 1.5 pro on challenging scientific coding w/o any data/model update

thumb_up_off_alt408

chat_bubble_outline12

repeat64

shareShare

Stella Li

@stellalisy

2 months ago

🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…

thumb_up_off_alt1,1K

chat_bubble_outline69

repeat322

shareShare

Rohan Paul

@rohanpaul_ai

2 months ago

Reinforcement learning improves LLMs, but people thought it needed full model updates. The paper shows reinforcement learning updates only a small part of the model. Methods 🔧: → Researchers measured parameter changes before and after RL fine-tuning using various public

thumb_up_off_alt9

chat_bubble_outline0

repeat3

shareShare

Sagnik Mukherjee

@saagnikkk

2 months ago

I also internally tested some tool calling models (RL-ed) from my labmates, and yes similar findings.

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Ganqu Cui

@charlesfornlp

2 months ago

So many works talking about entropy, but what is the **mechanism** of entropy in RL for LLMs? 🤔 Our work gives a principled understanding, as well as two tricks that get entropy **controlled** 🧵

thumb_up_off_alt125

chat_bubble_outline3

repeat16

shareShare

Lifan Yuan

@lifan__yuan

2 months ago

We always want to scale up RL, yet simply training longer doesn't necessarily push the limits - exploration gets impeded by entropy collapse. We show that the performance ceiling is surprisingly predictable, and the collapse is driven by covariance between logp and advantage.

thumb_up_off_alt546

chat_bubble_outline8

repeat85

shareShare

Xinyu Zhu

@tianhongzxy

2 months ago

🔥The debate’s been wild: How does the reward in RLVR actually improve LLM reasoning?🤔 🚀Introducing our new paper👇 💡TL;DR: Just penalizing incorrect rollouts❌ — no positive reward needed — can boost LLM reasoning, and sometimes better than PPO/GRPO! 🧵[1/n]

thumb_up_off_alt401

chat_bubble_outline6

repeat58

shareShare

Chhavi Yadav

@chhaviyadav_

2 months ago

Upon graduation, I paused to reflect on what my PhD had truly taught me. Was it just how to write papers, respond to brutal reviewer comments, and survive without much sleep? Or did it leave a deeper imprint on me — beyond the metrics and milestones? Turns out, it did. A

thumb_up_off_alt323

chat_bubble_outline22

repeat39

shareShare