Xinyu Zhu (@tianhongzxy) Twitter Tweets • TwiCopy

Oh man this is a gorgeous idea. Training *against* negative samples but not towards positive ones maintains entropy in the model, therefore increases pass@high k during RL.

thumb_up_off_alt307

chat_bubble_outline8

repeat18

shareShare

We find that a large base LM can be boosted to match its RL-tuned version📈 without training—simply by transferring the logit difference between a small RL-tuned model and its base at the inference time! 🤯 Check out the 🧵👇

thumb_up_off_alt7

chat_bubble_outline0

repeat0

shareShare

Sean Welleck

@wellecks

a month ago

New paper by Andre He: Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening arxiv.org/abs/2506.02355 Tired of sharpening the distribution? Try unlikeliness reward to learn new things from the roads less traveled

thumb_up_off_alt354

chat_bubble_outline4

repeat53

shareShare

Stanford NLP Group

@stanfordnlp

a month ago

Only ding a model for making mistakes! It gives better results in RL and avoids mode collapse. We still understand so little about RL! But we’re learning. Your science dollars at work.

thumb_up_off_alt48

chat_bubble_outline2

repeat13

shareShare

Xinyu Zhu

@tianhongzxy

a month ago

🚀 Check out our new #ICML2025 paper led by Zhepei Wei! Achieve 1.73× faster LLM decoding — no draft model needed, and no discrepancy from vanilla decoding!

thumb_up_off_alt13

chat_bubble_outline0

repeat1

shareShare

Xinyu Zhu

@tianhongzxy

a month ago

🚀 Interesting work by Taiqiang Wu! 💡 Quick takeaway: If you collect additional instruction-following data for SFT, it's better to fine-tune the Base model and then graft the weights onto its corresponding Instruct model — rather than continuing to train the Instruct model!

thumb_up_off_alt8

chat_bubble_outline1

repeat0

shareShare