Jing Xu (@jingxu_ml) 's Twitter Profile
Jing Xu

@jingxu_ml

LLM alignment, reasoning@FAIR; PhD from @Penn

jxmsml.github.io

ID: 1255518200298639361

calendar_today29-04-2020 15:24:22

60 Tweet

223 Followers

300 Following

Jason Weston (@jaseweston) 's Twitter Profile Photo

🚨New paper!🚨 Meta-Rewarding LMs - LM is actor, judge & meta-judge - Learns to reward actions better by judging its own judgments (assigning *meta-rewards*) - Improves acting & judging over time without human labels ... beats Self-Rewarding LMs arxiv.org/abs/2407.19594 🧵(1/6)

🚨New paper!🚨
Meta-Rewarding LMs
- LM is actor, judge & meta-judge
- Learns to reward actions better by judging its own judgments (assigning *meta-rewards*)
- Improves acting & judging over time without human labels
... beats Self-Rewarding LMs
arxiv.org/abs/2407.19594
🧵(1/6)
Jason Weston (@jaseweston) 's Twitter Profile Photo

🚨 Self-Consistency Preference Optimization (ScPO)🚨 - New self-training method without human labels - learn to make the model more consistent! - Works well for reasoning tasks where RMs fail to evaluate correctness. - Close to performance of supervised methods *without* labels,

🚨 Self-Consistency Preference Optimization (ScPO)🚨
- New self-training method without human labels - learn to make the model more consistent!
- Works well for reasoning tasks where RMs fail to evaluate correctness.
- Close to performance of supervised methods *without* labels,
Jiao Sun (@sunjiao123sun_) 's Twitter Profile Photo

Mitigating racial bias from LLMs is a lot easier than removing it from humans! Can’t believe this happened at the best AI conference NeurIPS Conference We have ethical reviews for authors, but missed it for invited speakers? 😡

Mitigating racial bias from LLMs is a lot easier than removing it from humans! 

Can’t believe this happened at the best AI conference <a href="/NeurIPSConf/">NeurIPS Conference</a> 

We have ethical reviews for authors, but missed it for invited speakers? 😡
Jason Weston (@jaseweston) 's Twitter Profile Photo

💀 Introducing RIP: Rejecting Instruction Preferences💀 A method to *curate* high quality data, or *create* high quality synthetic data. Large performance gains across benchmarks (AlpacaEval2, Arena-Hard, WildBench). Paper 📄: arxiv.org/abs/2501.18578

💀 Introducing RIP: Rejecting Instruction Preferences💀

A method to *curate* high quality data, or *create* high quality synthetic data.

Large performance gains across benchmarks (AlpacaEval2, Arena-Hard, WildBench).

Paper 📄: arxiv.org/abs/2501.18578
Jing Xu (@jingxu_ml) 's Twitter Profile Photo

New data selection & synthetic data creation method can dramatically improve model performance by filtering out 77% training examples!

Jason Weston (@jaseweston) 's Twitter Profile Photo

Olga Golovneva Tianhao Wu Weizhe Yuan Jing Xu @ICML2025 Sainbayar Sukhbaatar Ping (Iris) Yu 4/ 💀 We show RIP works across various data (WildChat, HelpSteer, Self-RIP), LLMs (Llama 3.1 8B or 3.3 70B) & reward models. We were surprised given how simple RIP is how well it works. Read the paper for more & hope you don't "reject" this research!🪦 📄arxiv.org/abs/2501.18578

Zichen Liu @ ICLR2025 (@zzlccc) 's Twitter Profile Photo

🚨There May Not be Aha Moment in R1-Zero-like Training: oatllm.notion.site/oat-zero A common belief about the recent R1-Zero-like training is that self-reflections *emerge* as a result of RL training. We carefully investigated and showed the opposite. 🧵

Anthropic (@anthropicai) 's Twitter Profile Photo

A few researchers at Anthropic have, over the past year, had a part-time obsession with a peculiar problem. Can Claude play Pokémon? A thread:

Jason Weston (@jaseweston) 's Twitter Profile Photo

🚨 New Paper 🚨 An Overview of Large Language Models for Statisticians 📝: arxiv.org/abs/2502.17814 - Dual perspectives on Statistics ➕ LLMs: Stat for LLM & LLM for Stat - Stat for LLM: How statistical methods can improve LLM uncertainty quantification, interpretability,

🚨 New Paper 🚨
An Overview of Large Language Models for Statisticians
📝: arxiv.org/abs/2502.17814

- Dual perspectives on Statistics ➕ LLMs: Stat for LLM &amp; LLM for Stat
- Stat for LLM: How statistical methods can improve LLM uncertainty quantification, interpretability,
Jason Weston (@jaseweston) 's Twitter Profile Photo

Google friends & ex-colleagues -- Google scholar seems pretty broken😔. Our most cited paper from last year "Self-Rewarding LLMs" has disappeared! Scholar has clustered it with another paper (SPIN) and it isn't in the search results. This is bad for PhD student & first author

Google friends &amp; ex-colleagues -- Google scholar seems pretty broken😔. Our most cited paper from last year "Self-Rewarding LLMs" has disappeared! Scholar has clustered it with another paper (SPIN) and it isn't in the search results. This is bad for PhD student &amp; first author
Archiki Prasad (@archikiprasad) 's Twitter Profile Photo

🎉 Excited to share that my internship work, ScPO, on self-training LLMs to improve reasoning without human labels, has been accepted to #ICML2025! Many thanks to my awesome collaborators at AI at Meta and @uncnlp🌞Looking forward to presenting ScPO in Vancouver 🇨🇦

Jason Weston (@jaseweston) 's Twitter Profile Photo

🚨Announcing RAM 2 workshop @ COLM25 - call for papers🚨 - 10 years on, we present the sequel to the classic RAM🐏 (Reasoning, Attention, Memory) workshop that took place in 2015 at the cusp of major change in the area. Now in 2025 we reflect on what's happened and discuss the

🚨Announcing RAM 2 workshop @ COLM25 - call for papers🚨 
- 10 years on, we present the sequel to the classic RAM🐏 (Reasoning, Attention, Memory) workshop that took place in 2015 at the cusp of major change in the area. Now in 2025 we reflect on what's happened and discuss the
Jason Weston (@jaseweston) 's Twitter Profile Photo

🌉 Bridging Offline & Online RL for LLMs 🌉 📝: arxiv.org/abs/2506.21495 New paper shows on verifiable & non-verifiable tasks: - Online DPO & GRPO give similar performance. - Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also. - Offline DPO

🌉 Bridging Offline &amp; Online RL for LLMs 🌉
📝: arxiv.org/abs/2506.21495
New paper shows on verifiable &amp; non-verifiable tasks:
- Online DPO &amp; GRPO give similar performance.
- Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also.
- Offline DPO
Jing Xu (@jingxu_ml) 's Twitter Profile Photo

Heading to ICML to present our work Rejecting Instruction Preference (RIP) for better data curation and synthesis on Wed 07/16 (4:30pm - 7:00pm)! Excited to connect with folks interested in synthetic data, reasoning, RL and anything in general@FAIR. #ICML2025

Jason Weston (@jaseweston) 's Twitter Profile Photo

🤖Introducing: CoT-Self-Instruct 🤖 📝: arxiv.org/abs/2507.23751 - Builds high-quality synthetic data via reasoning CoT + quality filtering - Gains on reasoning tasks: MATH500, AMC23, AIME24 & GPQA-💎 - Outperforms existing train data s1k & OpenMathReasoning - Gains on

🤖Introducing: CoT-Self-Instruct 🤖
📝: arxiv.org/abs/2507.23751
- Builds high-quality synthetic data via reasoning CoT + quality filtering
- Gains on reasoning tasks: MATH500, AMC23, AIME24 &amp; GPQA-💎
- Outperforms existing train data s1k &amp; OpenMathReasoning
- Gains on