Michael Luo (@michaelzluo) 's Twitter Profile
Michael Luo

@michaelzluo

CS PhD at UC Berkeley @berkeley_ai, Project Lead of @agentica_

ID: 1671255777565347840

linkhttps://michaelzhiluo.github.io calendar_today20-06-2023 20:36:57

152 Tweet

358 Followers

195 Following

Karan Dalal (@karansdalal) 's Twitter Profile Photo

I’m excited to share a project I’ve been working on for over a year, which I believe will fundamentally change our approach to language models. We’ve designed a new architecture, which replaces the hidden state of an RNN with a machine learning model. This model compresses

I’m excited to share a project I’ve been working on for over a year, which I believe will fundamentally change our approach to language models.

We’ve designed a new architecture, which replaces the hidden state of an RNN with a machine learning model. This model compresses
AmbiRobotics (@ambirobotics) 's Twitter Profile Photo

This week Encord hosted AI After Hours at GitHub HQ and our Foundation Model Lead, Vishal Satish, shared how Ambi Robotics is leveraging 200K+ hours of high-fidelity production data to train PRIME-1—a domain-expert foundation model designed for industrial reliability.

This week <a href="/encord_team/">Encord</a> hosted AI After Hours at <a href="/github/">GitHub</a> HQ and our Foundation Model Lead, Vishal Satish, shared how Ambi Robotics is leveraging 200K+ hours of high-fidelity production data to train PRIME-1—a domain-expert foundation model designed for industrial reliability.
Yifei Zhou (@yifeizhou02) 's Twitter Profile Photo

📢LLM and RL folks! 📢 No good RL algorithm for credit assignment for multi-turn LLM agents on reasoning-heavy tasks? Do not even have a good benchmark for studying it? In SWEET-RL, we give you both (a vibe coding benchmark and SWEET algorithm). A thread 🧵(1/n)

📢LLM and RL folks! 📢 No good RL algorithm for credit assignment for multi-turn LLM agents on reasoning-heavy tasks? Do not even have a good benchmark for studying it?

In SWEET-RL, we give you both (a vibe coding benchmark and SWEET algorithm). A thread 🧵(1/n)
Alex Gurung (@alexaag1234) 's Twitter Profile Photo

Preprint: Can we learn to reason for story generation (~100k tokens), without reward models? Yes! We introduce an RLVR-inspired reward paradigm VR-CLI that correlates with human judgements of quality on the 'novel' task of Next-Chapter Prediction. Paper: arxiv.org/abs/2503.22828

Preprint: Can we learn to reason for story generation (~100k tokens), without reward models?

Yes! We introduce an RLVR-inspired reward paradigm VR-CLI that correlates with human judgements of quality on the 'novel' task of Next-Chapter Prediction.

Paper: arxiv.org/abs/2503.22828
AK (@_akhaliq) 's Twitter Profile Photo

Deepseek just announced Inference-Time Scaling for Generalist Reward Modeling on Hugging Face show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve

Deepseek just announced Inference-Time Scaling for Generalist Reward Modeling on Hugging Face

show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve
Michael Luo (@michaelzluo) 's Twitter Profile Photo

🚀 We introduce DeepCoder-14B-Preview, a fully open-sourced coding model that is on par with o3-mini and o1! 📷 We scaled our model with RL magic up to 32K context. It's performance scales to 64K context 🔥

🚀 We introduce DeepCoder-14B-Preview, a fully open-sourced coding model that is on par with o3-mini and o1!

📷 We scaled our model with RL magic up to 32K context.  It's performance scales to 64K context 🔥
Naman Jain @ ICLR (@stringchaos) 's Twitter Profile Photo

Excited to release R2E-Gym - 🔥 8.1K executable environments using synthetic data - 🧠 Hybrid verifiers for enhanced inference-time scaling - 📈 51% success-rate on the SWE-Bench Verified - 🤗 Open Source Data + Models + Trajectories 1/

Excited to release R2E-Gym
  - 🔥 8.1K executable environments using synthetic data
  - 🧠 Hybrid verifiers for enhanced inference-time scaling
  - 📈 51% success-rate on the SWE-Bench Verified
  - 🤗 Open Source Data + Models + Trajectories

1/
Prime Intellect (@primeintellect) 's Twitter Profile Photo

Today we’re launching INTELLECT-2: The first decentralized 32B-parameter RL training run open to join for anyone with compute — fully permissionless. Scaling towards frontier reasoning across coding, math and science.

Xeophon (@thexeophon) 's Twitter Profile Photo

the vLLM vs SGLang beef is the weirdest (and saddest) thing ever both are under the Linux foundation, could join forces and make the best inference framework ever :/

Brandon Trabucco @ ICLR (@brandontrabucco) 's Twitter Profile Photo

🌏 Building web-scale agents, and tired of Math and Coding tasks? Come chat with us at ICLR in Singapore. We are presenting InSTA at the DATA-FM workshop in the second Oral session, April 28th 2:30pm. InSTA is the largest environment for training agents, spanning 150k live

Ahmad Beirami @ ICLR 2025 (@abeirami) 's Twitter Profile Photo

As we go through a lot of excitement about RL recently with lots of cool work/results, here is a reminder that RL with a reverse KL-regularizer to the base model cannot learn new skills that were not already present in the base model. It can only amplify the existing weak skills.

As we go through a lot of excitement about RL recently with lots of cool work/results, here is a reminder that RL with a reverse KL-regularizer to the base model cannot learn new skills that were not already present in the base model. It can only amplify the existing weak skills.
Fahim Tajwar (@fahimtajwar10) 's Twitter Profile Photo

RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers? Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training! 🧵 1/n

RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers?

Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training!

🧵 1/n
Manish Shetty (@slimshetty_) 's Twitter Profile Photo

✨ NEW SWE-Agents BENCHMARK ✨ Introducing GSO: The Global Software Optimization Benchmark - 👩🏻‍💻 100+ challenging software optimization tasks - 🛣️ a long-horizon task w/ precise specification - 🐘 large code changes in Py, C, C++, ... - 📉 SOTA models get < 5% success! 1/

✨ NEW SWE-Agents BENCHMARK ✨

Introducing GSO: The Global Software Optimization Benchmark
 - 👩🏻‍💻 100+ challenging software optimization tasks
 - 🛣️ a long-horizon task w/ precise specification
 - 🐘 large code changes in Py, C, C++, ...
 - 📉 SOTA models get &lt; 5% success!

1/
Agentica Project (@agentica_) 's Twitter Profile Photo

It's easy to confuse Best@K vs Pass@K—and we've seen some misconceptions about our results. Our 59% on SWEBench-Verified is Pass@1 with Best@16, not Pass@8/16. Our Pass@8/16 is 67%/71%. So how did we achieve this? DeepSWE generates N candidate solutions. Then, another LLM

It's easy to confuse Best@K vs Pass@K—and we've seen some misconceptions about our results.  

Our 59% on SWEBench-Verified is Pass@1 with Best@16, not Pass@8/16. Our Pass@8/16 is 67%/71%.  

So how did we achieve this? 

DeepSWE generates N candidate solutions. Then, another LLM
Michael Luo (@michaelzluo) 's Twitter Profile Photo

We've noticed that quite a lot of sources claim credit from one-off pipelining, which originated from our work DeepCoder. Not only SemiAnalysis Dylan Patel but also bigger companies such as Meta's LLAMA RL paper (see Figure 2), that refuse to cite us to claim credit.

We've noticed that quite a lot of sources claim credit from one-off pipelining, which originated from our work DeepCoder.

Not only SemiAnalysis <a href="/dylan522p/">Dylan Patel</a> but also bigger companies such as Meta's LLAMA RL paper (see Figure 2), that refuse to cite us to claim credit.
Michael Luo (@michaelzluo) 's Twitter Profile Photo

🔮 The future is AGENTS for all applications. In the first 6 months we perfected RL for verifiable‑reward reasoning—single step chain‑of‑thought, deterministic answers. Now, the next years belong to multi‑agent systems—multiple steps (does not need thought), multiple agents