Sijun Tan (@sijun_tan) 's Twitter Profile
Sijun Tan

@sijun_tan

CS PhD @BerkeleySky | Scaling AI Agents @Agentica_ | Prev: @AIatMeta @Antgroup

ID: 938526574072262656

linkhttp://jeffreysijuntan.com calendar_today06-12-2017 21:52:27

117 Tweet

1,1K Followers

320 Following

Sijun Tan (@sijun_tan) 's Twitter Profile Photo

Check out Michael Luo's latest work, Autellix—an ultra-fast system for serving agentic workloads, achieving 4-15x speedups over vLLM/SGLang! At Agentica, we are committed to building efficient infra for serving/training of LLM agents, and Autellix is the first step towards it!

Sijun Tan (@sijun_tan) 's Twitter Profile Photo

Exciting to see the community successfully reproducing DeepScaleR’s results—this is the true power of open-source! By sharing everything openly, we enable faster progress and collective innovation. Let's build together!!

Sijun Tan (@sijun_tan) 's Twitter Profile Photo

Quoting this legendary Apple Ad that I think best encapsulates the spirit of Agency: "Here's to the crazy ones. The misfits. The rebels. The troublemakers. The round pegs in the square holes. The ones who see things differently. They're not fond of rules. And they have no

Michael Luo (@michaelzluo) 's Twitter Profile Photo

🚀 We introduce DeepCoder-14B-Preview, a fully open-sourced coding model that is on par with o3-mini and o1! 📷 We scaled our model with RL magic up to 32K context. It's performance scales to 64K context 🔥

🚀 We introduce DeepCoder-14B-Preview, a fully open-sourced coding model that is on par with o3-mini and o1!

📷 We scaled our model with RL magic up to 32K context.  It's performance scales to 64K context 🔥
Sijun Tan (@sijun_tan) 's Twitter Profile Photo

Hey Sam Altman, we know you're planning to open-source your reasoning model—but we couldn’t wait. Introducing DeepCoder-14B-Preview: a fully open-source reasoning model that matches o1 and o3-mini on both coding and math. And yes, we’re releasing everything: model, data, code, and

Chan Kha Vu 🇺🇦🌻🚜 (@chankhavu) 's Twitter Profile Photo

Babe, wake up, we have o3-mini at home 🤯 And, as usual, a Notion post instead of ArXiv paper. Peak Alpha energy 🫡 There are so many good engineering bits in this report 😍

Naman Jain @ ICLR (@stringchaos) 's Twitter Profile Photo

Excited to release R2E-Gym - 🔥 8.1K executable environments using synthetic data - 🧠 Hybrid verifiers for enhanced inference-time scaling - 📈 51% success-rate on the SWE-Bench Verified - 🤗 Open Source Data + Models + Trajectories 1/

Excited to release R2E-Gym
  - 🔥 8.1K executable environments using synthetic data
  - 🧠 Hybrid verifiers for enhanced inference-time scaling
  - 📈 51% success-rate on the SWE-Bench Verified
  - 🤗 Open Source Data + Models + Trajectories

1/
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxestex) 's Twitter Profile Photo

I keep saying that DeepScaleR is among the most impressive branches on the R1 tree, maybe the best one. Bytedance would do good to release a small Seed-Thinking for comparison though

Sijun Tan (@sijun_tan) 's Twitter Profile Photo

This is a great blog post to read. RL finally works—not because of major advances in algorithms, but because we now have strong pretrained models that provide a good prior. From there, we finetune the model to adapt to different kinds of environments. The upper bound of the

Sijun Tan (@sijun_tan) 's Twitter Profile Photo

If you’re at ICLR, swing by Session 4 today and check out our work JudgeBench! It’s a benchmark for evaluating LLM judges on their ability to distinguish between challenging reasoning responses. Not many reasoning papers made it to ICLR this year—o1 had just dropped around the

Sijun Tan (@sijun_tan) 's Twitter Profile Photo

Very interesting finding! RL can even work with incorrect rewards because the GRPO clipping term introduces a bias toward optimizing high-probability tokens, leading to a more concentrated distribution around them. This essentially means that if your model has learned a strong

Yi Wu (@jxwuyi) 's Twitter Profile Photo

We release fully async RL system AReaL-boba² for LLM & SOTA code RL w. Qwen3-14B! Qwen #opensource 🚀system&algorithm co-design → 2.77x faster ✅ 69.1 on LiveCodeBench 🔥 multi-turn RL ready 🔗 Project: github.com/inclusionAI/AR… 📄 Paper: arxiv.org/pdf/2505.24298 1/3👇

We release fully async RL system AReaL-boba² for LLM &amp; SOTA code RL w. Qwen3-14B! <a href="/Alibaba_Qwen/">Qwen</a> #opensource 
🚀system&amp;algorithm co-design → 2.77x faster 
✅ 69.1 on LiveCodeBench 
🔥 multi-turn RL ready
🔗 Project: github.com/inclusionAI/AR…
📄 Paper: arxiv.org/pdf/2505.24298
1/3👇
Sijun Tan (@sijun_tan) 's Twitter Profile Photo

The first half of 2025 is all about reasoning models. The second half? It’s about agents. At Agentica, we’re thrilled to launch two major releases: 1. DeepSWE, our STOA coding agent trained with RL that tops SWEBench leaderboard for open-weight models. 2. rLLM, our agent

Jaskirat Singh (@1jaskiratsingh) 's Twitter Profile Photo

How much can we scale long-context multi-step agents using only RL? Short Answer: Quite a lot, given good training environments and scalable RL recipe. 🚨 We introduce DeepSWE-Preview, a reasoning-enabled coding agent trained from scratch from Qwen3-32B with only reinforcement

Agentica Project (@agentica_) 's Twitter Profile Photo

It's easy to confuse Best@K vs Pass@K—and we've seen some misconceptions about our results. Our 59% on SWEBench-Verified is Pass@1 with Best@16, not Pass@8/16. Our Pass@8/16 is 67%/71%. So how did we achieve this? DeepSWE generates N candidate solutions. Then, another LLM

It's easy to confuse Best@K vs Pass@K—and we've seen some misconceptions about our results.  

Our 59% on SWEBench-Verified is Pass@1 with Best@16, not Pass@8/16. Our Pass@8/16 is 67%/71%.  

So how did we achieve this? 

DeepSWE generates N candidate solutions. Then, another LLM
Sijun Tan (@sijun_tan) 's Twitter Profile Photo

We've seen some misconceptions of our results. DeepSWE reports Best@K, not Pass@K, and they are very different! This post explains everything:

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxestex) 's Twitter Profile Photo

important correction on DeepSWE-Preview. on SWE-Bench-Verified: Pass@1 = 42.2% "Best@8" = 59%, trajectory selection achieved with a hybrid (execution-free + test-based, as in R2E-Gym paper) verifier. Ie the system itself can yield 59% w/o ground truth check. Actual pass@8 =67%.

important correction on DeepSWE-Preview. on SWE-Bench-Verified:
Pass@1 = 42.2%
"Best@8" = 59%, trajectory selection achieved with a hybrid (execution-free + test-based, as in R2E-Gym paper) verifier. Ie the system itself can yield 59% w/o ground truth check.
Actual pass@8 =67%.