Rafael Rafailov @ NeurIPS (@rm_rafailov) 's Twitter Profile
Rafael Rafailov @ NeurIPS

@rm_rafailov

Ph.D. Student at @StanfordAILab. I work on Foundation Models and Decision Making. Previously @GoogleDeepMind @UCBerkeley

ID: 1660344669916786688

linkhttps://rmrafailov.github.io/ calendar_today21-05-2023 18:11:57

1,1K Tweet

6,6K Followers

776 Following

Rafael Rafailov @ NeurIPS (@rm_rafailov) 's Twitter Profile Photo

“We developed a fully asynchronous online RL training framework that enhanced flexibility. …. This innovation resulted in a ~10x improvement in training efficiency over previous generations.” Asynch distributed RL strikes again!

Suraj Nair (@surajnair_1) 's Twitter Profile Photo

Since the first year of my PhD, every talk I’ve given has opened with a slide about the distant north star: dropping a robot in a home it’s never been before and having it do useful things. I think it might be time for me to find a new opening slide 😀. Thrilled to share π-0.5!

Aviral Kumar (@aviral_kumar2) 's Twitter Profile Photo

At #ICLR25 workshops, my students+collabs will give many orals talks on newer stuff (don't miss!): - robot VLA RL fine-tuning Max Sobol Mark - optimizing test-time compute Yuxiao Qu - why RL is crucial for test-time scaling Amrith Setlur - scaling laws for value-based RL

John Yang (@jyangballin) 's Twitter Profile Photo

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified.

We built it by synthesizing a ton of agentic training data from 100+ Python repos.

Today we’re open-sourcing the toolkit that made it happen: SWE-smith.
Jason Weston (@jaseweston) 's Twitter Profile Photo

🚨 New paper 🚨 J1: Incentivizing Thinking in LLM-as-a-Judge via RL - Converts judgement task into a verifiable one for both verifiable and non-verifiable prompts. Uses only synthetic pairwise data - Optimizes thoughts, scores, and judgments using GRPO - Outperforms all

🚨 New paper 🚨
J1: Incentivizing Thinking in LLM-as-a-Judge via RL

- Converts judgement task into a verifiable one for both verifiable and non-verifiable prompts. Uses only synthetic pairwise data

- Optimizes thoughts, scores, and judgments using GRPO

- Outperforms all
Rafael Rafailov @ NeurIPS (@rm_rafailov) 's Twitter Profile Photo

When we first published our work on this 9 months ago it was rejected for being impractical in realistic cases. Six months later it was rejected for lack of novelty. It’s the way academic publishing goes.

James Alcorn (@jamesalcorn94) 's Twitter Profile Photo

congrats Rafael Rafailov @ NeurIPS on your hard-earned acceptance to the USofA as alien of officially extraordinary ability. The alien piece comes as no surprise to your mates of course, but at least the general public now has fair warning and a fighting chance. To celebrate with a fitting

congrats <a href="/rm_rafailov/">Rafael Rafailov @ NeurIPS</a> on your hard-earned acceptance to the USofA as alien of officially extraordinary ability. The alien piece comes as no surprise to your mates of course, but at least the general public now has fair warning and a fighting chance. To celebrate with a fitting
SynthLabs (@synth_labs) 's Twitter Profile Photo

Our new method (ALP) monitors solve rates across RL rollouts and applies inverse difficulty penalties during RL training. Result? Models learn an implicit difficulty estimator—allocating 5x more tokens to hard vs easy problems, cutting overall usage by 50% 🧵👇1/10

Our new method (ALP) monitors solve rates across RL rollouts and applies inverse difficulty penalties during RL training.

Result? Models learn an implicit difficulty estimator—allocating 5x more tokens to hard vs easy problems, cutting overall usage by 50%

🧵👇1/10