Ion Stoica (@istoica05) Twitter Tweets • TwiCopy

Melissa Pan

6 months ago

🚨 Why Do Multi-Agent LLM Systems Fail? ⁉️ 🔥 Introducing MAST: The first multi-agent failure taxonomy - consists of 14 failure modes and 3 categories, generalizes for diverse multi-agent systems and tasks! Paper: arxiv.org/pdf/2503.13657 Code: github.com/multi-agent-sy… 🧵1/n

thumb_up_off_alt186

chat_bubble_outline4

repeat54

shareShare

Ion Stoica

@istoica05

6 months ago

This journey has been a blast, and I'm very much looking forward to an exciting future, driven by our incredible community.

thumb_up_off_alt99

chat_bubble_outline3

repeat6

shareShare

SkyPilot

@skypilot_org

6 months ago

What a night! Huge thanks to everyone who came out to our first SkyPilot meetup — a packed house of builders and insightful convos.💥 Thanks to all speakers (sisil mehta Abridge, Woosuk Kwon vLLM, Ion Stoica, et al) for sharing SkyPilot use cases, and Anyscale

thumb_up_off_alt22

chat_bubble_outline1

repeat6

shareShare

lmarena.ai (formerly lmsys.org)

@lmarena_ai

6 months ago

We thank the authors' for their feedback. However, there are a number of factual errors and misleading statements in this writeup: Regarding the statement that some model providers are not treated fairly: - This is not true. Given our capacity, we have always tried to honor all

thumb_up_off_alt260

chat_bubble_outline17

repeat32

shareShare

Ion Stoica

@istoica05

6 months ago

🙏

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Lakshya A Agrawal

@lakshyaaagrawal

6 months ago

Real world AI pipelines are often compound, multi-module, and multi-step programs—unlike most RL/GRPO implementations today which optimize a single agent. 🚨 Super excited to release dspy.GRPO, which lets you GRPO tune any arbitrary multi-module, multi-step DSPy program, with

thumb_up_off_alt60

chat_bubble_outline0

repeat15

shareShare

Ali Ghodsi

@alighodsi

5 months ago

I am super excited to announce that we have agreed to acquire Neon, a developer-centric serverless Postgres company. The Neon team engineered a new database architecture that offers speed, elastic scaling, and branching and forking. The capabilities that make Neon great for

thumb_up_off_alt630

chat_bubble_outline18

repeat74

shareShare

Percy Liang

@percyliang

5 months ago

What would truly open-source AI look like? Not just open weights, open code/data, but *open development*, where the entire research and development process is public *and* anyone can contribute. We built Marin, an open lab, to fulfill this vision:

thumb_up_off_alt939

chat_bubble_outline39

repeat185

shareShare

NovaSky

@novaskyai

5 months ago

1/N Introducing SkyRL-SQL, a simple, data-efficient RL pipeline for Text-to-SQL that trains LLMs to interactively probe, refine, and verify SQL queries with a real database. 🚀 Early Result: trained on just ~600 samples, SkyRL-SQL-7B outperforms GPT-4o, o4-mini, and SFT model

thumb_up_off_alt136

chat_bubble_outline3

repeat27

shareShare

Sumanth Hegde

@sumanthrh

5 months ago

Some of our interesting observations from working on multi-turn text2SQL: - Data-efficient RL works pretty well: We did very typical GRPO settings; Just make sure to use "hard-enough" samples and no KL. KL can stabilize learning early on but will always bring down rewards

thumb_up_off_alt17

chat_bubble_outline0

repeat5

shareShare

Andy Konwinski

@andykonwinski

5 months ago

If you had 15min to tell thousands of Berkeley CS/Data/Stats grads what to do with their lives, what would you say? Last Thursday I told them to RUN AT FAILURE. Afterwards, while we were shaking hands & taking selfies, hundreds of them told me that they are excited to go fail. I

thumb_up_off_alt252

chat_bubble_outline17

repeat28

shareShare

Manish Shetty

@slimshetty_

5 months ago

✨ NEW SWE-Agents BENCHMARK ✨ Introducing GSO: The Global Software Optimization Benchmark - 👩🏻‍💻 100+ challenging software optimization tasks - 🛣️ a long-horizon task w/ precise specification - 🐘 large code changes in Py, C, C++, ... - 📉 SOTA models get < 5% success! 1/

thumb_up_off_alt121

chat_bubble_outline6

repeat26

shareShare

Robert Nishihara

@robertnishihara

5 months ago

The AI compute software stack consists of 3 specialized layers: 🔧🔧🔧 Layer 1: Training & Inference Framework (PyTorch + vLLM) • Runs models efficiently on GPUs • Handles model optimization and model parallelism strategies • Manages accelerator memory and automatic

thumb_up_off_alt63

chat_bubble_outline1

repeat20

shareShare

uccl_project

@uccl_proj

5 months ago

1/N 📢 Introducing UCCL (Ultra & Unified CCL), an efficient collective communication library for ML training and inference, outperforming NCCL by up to 2.5x 🚀 Code: github.com/uccl-project/u… Blog: uccl-project.github.io/posts/about-uc… Results: AllReduce on 6 HGX across 2 racks over RoCE RDMA

thumb_up_off_alt31

chat_bubble_outline1

repeat13

shareShare

Hao AI Lab

@haoailab

4 months ago

[Lmgame Bench] o3-pro: A Milestone in LLM Gaming! 🕹️ The leap from o3 to o3-pro is bigger than you might have thought. We tested o3-pro on Tetris and Sokoban— achieved SOTA on both and outperformed its previous self by a big margin. 🔍 🧱 Tetris Update o3-pro: ✅ 8+ lines

thumb_up_off_alt558

chat_bubble_outline12

repeat110

shareShare

Ion Stoica

@istoica05

4 months ago

Taking a step towards building a modular RL framework with our SkyRL project.

thumb_up_off_alt60

chat_bubble_outline4

repeat13

shareShare

Agentica Project

@agentica_

4 months ago

🚀 Introducing DeepSWE 🤖: our fully open-sourced, SOTA software engineering agent trained purely with RL on top of Qwen3-32B. DeepSWE achieves 59% on SWEBench-Verified with test-time scaling (and 42.2% Pass@1), topping the SWEBench leaderboard for open-weight models. 💪DeepSWE

thumb_up_off_alt345

chat_bubble_outline15

repeat65

shareShare

Daniel Kang

@daniel_d_kang

4 months ago

As AI agents near real-world use, how do we know what they can actually do? Reliable benchmarks are critical but agentic benchmarks are broken! Example: WebArena marks "45+8 minutes" on a duration calculation task as correct (real answer: "63 minutes"). Other benchmarks

thumb_up_off_alt47

chat_bubble_outline1

repeat17

shareShare

Robert Nishihara

@robertnishihara

3 months ago

Congratulations to my brilliant co-founder Philipp Moritz (Philipp Moritz) and the legendary John Schulman, Sergey Levine, Pieter Abbeel, and Michael Jordan on their Test-of-Time Honorable Mention at ICML 2025 today! For creating TRPO. This was done during the previous wave of

Congratulations to my brilliant co-founder Philipp Moritz (<a href="/pcmoritz/">Philipp Moritz</a>) and the legendary John Schulman, Sergey Levine, Pieter Abbeel, and Michael Jordan on their Test-of-Time Honorable Mention at ICML 2025 today!

For creating TRPO. This was done during the previous wave of

thumb_up_off_alt132

chat_bubble_outline1

repeat10

shareShare

martin_casado

@martin_casado

3 months ago

Remarkable how far we've come. From fear mongering OS AI across VC, academia, AI labs, and politicians to full throated endorsement. Thank you to everyone who has taken a stand on this over the last couple of years. In no small way have you helped sway and save a nation.

thumb_up_off_alt378

chat_bubble_outline16

repeat67

shareShare