Sagnik Mukherjee (@saagnikkk) 's Twitter Profile
Sagnik Mukherjee

@saagnikkk

CS PhD student at @IllinoisCDS @convai_uiuc

ID: 1617896838950187009

linkhttps://sagnikmukherjee.github.io/ calendar_today24-01-2023 14:47:35

46 Tweet

107 Followers

159 Following

Lifan Yuan (@lifan__yuan) 's Twitter Profile Photo

The most exciting finding I learned recently! RL intrinsically leads to sparse updates, while SFT updates densely. We are still investigating if the updated gradients relate to “critical” params. Hope our findings help better understand RL and motivate thoughts on efficiency

Sagnik Mukherjee (@saagnikkk) 's Twitter Profile Photo

LLMs have opened a great variety of research topics and this survey on persuasion by my awesome lab mate Beyza Bozdag is definitely a must read 😉

Shivam Agarwal (@shivamag12) 's Twitter Profile Photo

Can entropy minimization alone improve LLM performance? And how far can they go without any labeled data? This work answers both: yes, and surprisingly far 🐮 At inference EM can beat GPT4o Claude 3 opus & Gemini 1.5 pro on challenging scientific coding w/o any data/model update

Can entropy minimization alone improve LLM performance? And how far can they go without any labeled data? This work answers both: yes, and surprisingly far 🐮

At inference EM can beat GPT4o Claude 3 opus & Gemini 1.5 pro on challenging scientific coding w/o any data/model update
Stella Li (@stellalisy) 's Twitter Profile Photo

🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…

🤯 We cracked RLVR with... Random Rewards?!
Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by:
- Random rewards: +21%
- Incorrect rewards: +25%
- (FYI) Ground-truth rewards: + 28.8%
How could this even work⁉️ Here's why: 🧵
Blogpost: tinyurl.com/spurious-rewar…
Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

Reinforcement learning improves LLMs, but people thought it needed full model updates. The paper shows reinforcement learning updates only a small part of the model. Methods 🔧: → Researchers measured parameter changes before and after RL fine-tuning using various public

Reinforcement learning improves LLMs, but people thought it needed full model updates.

The paper shows reinforcement learning updates only a small part of the model.

Methods 🔧:

→ Researchers measured parameter changes before and after RL fine-tuning using various public
Ganqu Cui (@charlesfornlp) 's Twitter Profile Photo

So many works talking about entropy, but what is the **mechanism** of entropy in RL for LLMs? 🤔 Our work gives a principled understanding, as well as two tricks that get entropy **controlled** 🧵

So many works talking about entropy, but what is the **mechanism** of entropy in RL for LLMs? 🤔

Our work gives a principled understanding, as well as two tricks that get entropy **controlled** 🧵
Lifan Yuan (@lifan__yuan) 's Twitter Profile Photo

We always want to scale up RL, yet simply training longer doesn't necessarily push the limits - exploration gets impeded by entropy collapse. We show that the performance ceiling is surprisingly predictable, and the collapse is driven by covariance between logp and advantage.

We always want to scale up RL, yet simply training longer doesn't necessarily push the limits - exploration gets impeded by entropy collapse. 
We show that the performance ceiling is surprisingly predictable, and the collapse is driven by covariance between logp and advantage.
Xinyu Zhu (@tianhongzxy) 's Twitter Profile Photo

🔥The debate’s been wild: How does the reward in RLVR actually improve LLM reasoning?🤔 🚀Introducing our new paper👇 💡TL;DR: Just penalizing incorrect rollouts❌ — no positive reward needed — can boost LLM reasoning, and sometimes better than PPO/GRPO! 🧵[1/n]

🔥The debate’s been wild: How does the reward in RLVR actually improve LLM reasoning?🤔
🚀Introducing our new paper👇
💡TL;DR: Just penalizing incorrect rollouts❌ — no positive reward needed — can boost LLM reasoning, and sometimes better than PPO/GRPO!

🧵[1/n]
Chhavi Yadav (@chhaviyadav_) 's Twitter Profile Photo

Upon graduation, I paused to reflect on what my PhD had truly taught me. Was it just how to write papers, respond to brutal reviewer comments, and survive without much sleep? Or did it leave a deeper imprint on me — beyond the metrics and milestones? Turns out, it did. A

Upon graduation, I paused to reflect on what my PhD had truly taught me. Was it just how to write papers, respond to brutal reviewer comments, and survive without much sleep? Or did it leave a deeper imprint on me — beyond the metrics and milestones? Turns out, it did.

A