Michael Luo (@michaelzluo) Twitter Tweets • TwiCopy

Karan Dalal

a year ago

I’m excited to share a project I’ve been working on for over a year, which I believe will fundamentally change our approach to language models. We’ve designed a new architecture, which replaces the hidden state of an RNN with a machine learning model. This model compresses

thumb_up_off_alt1,1K

chat_bubble_outline49

repeat287

shareShare

Wei-Lin Chiang

@infwinston

10 months ago

Prompt-to-Leaderboard is now LIVE❤️‍🔥 Input any prompt → leaderboard for you in real-time. Huge shoutout to the incredible team that made this happen! Evan Connor Chen Joseph Tennyson Tianle (Tim) Li Wei-Lin Chiang Anastasios Nikolas Angelopoulos Ion Stoica

thumb_up_off_alt29

chat_bubble_outline1

repeat7

shareShare

AmbiRobotics

@ambirobotics

10 months ago

This week Encord hosted AI After Hours at GitHub HQ and our Foundation Model Lead, Vishal Satish, shared how Ambi Robotics is leveraging 200K+ hours of high-fidelity production data to train PRIME-1—a domain-expert foundation model designed for industrial reliability.

This week <a href="/encord_team/">Encord</a> hosted AI After Hours at <a href="/github/">GitHub</a> HQ and our Foundation Model Lead, Vishal Satish, shared how Ambi Robotics is leveraging 200K+ hours of high-fidelity production data to train PRIME-1—a domain-expert foundation model designed for industrial reliability.

thumb_up_off_alt6

chat_bubble_outline0

repeat3

shareShare

Yifei Zhou

@yifeizhou02

9 months ago

📢LLM and RL folks! 📢 No good RL algorithm for credit assignment for multi-turn LLM agents on reasoning-heavy tasks? Do not even have a good benchmark for studying it? In SWEET-RL, we give you both (a vibe coding benchmark and SWEET algorithm). A thread 🧵(1/n)

thumb_up_off_alt379

chat_bubble_outline3

repeat80

shareShare

Alex Gurung

@alexaag1234

9 months ago

Preprint: Can we learn to reason for story generation (~100k tokens), without reward models? Yes! We introduce an RLVR-inspired reward paradigm VR-CLI that correlates with human judgements of quality on the 'novel' task of Next-Chapter Prediction. Paper: arxiv.org/abs/2503.22828

thumb_up_off_alt324

chat_bubble_outline7

repeat48

shareShare

AK

@_akhaliq

8 months ago

Deepseek just announced Inference-Time Scaling for Generalist Reward Modeling on Hugging Face show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve

thumb_up_off_alt543

chat_bubble_outline18

repeat84

shareShare

Michael Luo

@michaelzluo

8 months ago

🚀 We introduce DeepCoder-14B-Preview, a fully open-sourced coding model that is on par with o3-mini and o1! 📷 We scaled our model with RL magic up to 32K context. It's performance scales to 64K context 🔥

thumb_up_off_alt111

chat_bubble_outline9

repeat15

shareShare

Naman Jain @ ICLR

@stringchaos

8 months ago

Excited to release R2E-Gym - 🔥 8.1K executable environments using synthetic data - 🧠 Hybrid verifiers for enhanced inference-time scaling - 📈 51% success-rate on the SWE-Bench Verified - 🤗 Open Source Data + Models + Trajectories 1/

thumb_up_off_alt259

chat_bubble_outline15

repeat62

shareShare

Prime Intellect

@primeintellect

8 months ago

Today we’re launching INTELLECT-2: The first decentralized 32B-parameter RL training run open to join for anyone with compute — fully permissionless. Scaling towards frontier reasoning across coding, math and science.

thumb_up_off_alt1,1K

chat_bubble_outline96

repeat298

shareShare

Agentica Project

@agentica_

8 months ago

We're trending on Hugging Face models today! 🔥 Huge thanks to our amazing community for your support. 🙏

We're trending on <a href="/huggingface/">Hugging Face</a> models today! 🔥

Huge thanks to our amazing community for your support. 🙏

thumb_up_off_alt47

chat_bubble_outline3

repeat6

shareShare

Xeophon

@thexeophon

8 months ago

the vLLM vs SGLang beef is the weirdest (and saddest) thing ever both are under the Linux foundation, could join forces and make the best inference framework ever :/

thumb_up_off_alt112

chat_bubble_outline11

repeat2

shareShare

Brandon Trabucco @ ICLR

@brandontrabucco

8 months ago

🌏 Building web-scale agents, and tired of Math and Coding tasks? Come chat with us at ICLR in Singapore. We are presenting InSTA at the DATA-FM workshop in the second Oral session, April 28th 2:30pm. InSTA is the largest environment for training agents, spanning 150k live

thumb_up_off_alt40

chat_bubble_outline0

repeat6

shareShare

Ahmad Beirami @ ICLR 2025

@abeirami

7 months ago

As we go through a lot of excitement about RL recently with lots of cool work/results, here is a reminder that RL with a reverse KL-regularizer to the base model cannot learn new skills that were not already present in the base model. It can only amplify the existing weak skills.

thumb_up_off_alt475

chat_bubble_outline12

repeat52

shareShare

Fahim Tajwar

@fahimtajwar10

7 months ago

RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers? Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training! 🧵 1/n

thumb_up_off_alt819

chat_bubble_outline20

repeat136

shareShare

Manish Shetty

@slimshetty_

7 months ago

✨ NEW SWE-Agents BENCHMARK ✨ Introducing GSO: The Global Software Optimization Benchmark - 👩🏻‍💻 100+ challenging software optimization tasks - 🛣️ a long-horizon task w/ precise specification - 🐘 large code changes in Py, C, C++, ... - 📉 SOTA models get < 5% success! 1/

thumb_up_off_alt121

chat_bubble_outline6

repeat26

shareShare

Agentica Project

@agentica_

6 months ago

It's easy to confuse Best@K vs Pass@K—and we've seen some misconceptions about our results. Our 59% on SWEBench-Verified is Pass@1 with Best@16, not Pass@8/16. Our Pass@8/16 is 67%/71%. So how did we achieve this? DeepSWE generates N candidate solutions. Then, another LLM

thumb_up_off_alt51

chat_bubble_outline1

repeat15

shareShare

Michael Luo

@michaelzluo

5 months ago

We've noticed that quite a lot of sources claim credit from one-off pipelining, which originated from our work DeepCoder. Not only SemiAnalysis Dylan Patel but also bigger companies such as Meta's LLAMA RL paper (see Figure 2), that refuse to cite us to claim credit.

We've noticed that quite a lot of sources claim credit from one-off pipelining, which originated from our work DeepCoder.

Not only SemiAnalysis <a href="/dylan522p/">Dylan Patel</a> but also bigger companies such as Meta's LLAMA RL paper (see Figure 2), that refuse to cite us to claim credit.

thumb_up_off_alt51

chat_bubble_outline3

repeat3

shareShare

Michael Luo

@michaelzluo

5 months ago

🔮 The future is AGENTS for all applications. In the first 6 months we perfected RL for verifiable‑reward reasoning—single step chain‑of‑thought, deterministic answers. Now, the next years belong to multi‑agent systems—multiple steps (does not need thought), multiple agents

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare