Yifei Zhou (@yifeizhou02) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Is RL really scalable like other objectives? We found that just scaling up data and compute is *not* enough to enable RL to solve complex tasks. The culprit is the horizon. Paper: arxiv.org/abs/2506.04168 Thread ↓

thumb_up_off_alt880

chat_bubble_outline9

repeat137

shareShare

Aviral Kumar

@aviral_kumar2

2 months ago

Can offline RL methods do well on any problem, as we scale compute and data? In our new paper led by Seohong Park, we show that task horizon can fundamentally hinder scaling for offline RL, and how explicitly reducing task horizon can address this. arxiv.org/abs/2506.04168 🧵⬇️

Can offline RL methods do well on any problem, as we scale compute and data?

In our new paper led by <a href="/seohong_park/">Seohong Park</a>, we show that task horizon can fundamentally hinder scaling for offline RL, and how explicitly reducing task horizon can address this.
arxiv.org/abs/2506.04168

🧵⬇️

thumb_up_off_alt130

chat_bubble_outline2

repeat19

shareShare

Aviral Kumar

@aviral_kumar2

2 months ago

Looking back, some of the most effective methods that we've built for training LLM/VLM agents in multi-turn settings also *needed* to utilize such a hierarchical structure, e.g., ArCHer (yifeizhou02.github.io/archer.io/) by Yifei Zhou, further showing the promise behind such ideas.

thumb_up_off_alt11

chat_bubble_outline2

repeat1

shareShare

elvis

@omarsar0

2 months ago

Self-Challenging LLM Agents Self-improving AI systems are starting to show up everywhere. Meta and colleagues present self-improvement for general multi-turn tool-use LLM agents. Pay attention to this one, devs! Here are my notes:

thumb_up_off_alt683

chat_bubble_outline15

repeat131

shareShare

Sergey Levine

@svlevine

2 months ago

I always found it puzzling how language models learn so much from next-token prediction, while video models learn so little from next frame prediction. Maybe it's because LLMs are actually brain scanners in disguise. Idle musings in my new blog post: sergeylevine.substack.com/p/language-mod…

thumb_up_off_alt723

chat_bubble_outline16

repeat82

shareShare

Yifei Zhou

@yifeizhou02

2 months ago

In this paper we explore he we can efficiently scale the inference time compute for agents. Instead of blindly scaling the number of tokens at each step, it would be much better to scale the number of interactions! Check out how we did it!

thumb_up_off_alt29

chat_bubble_outline0

repeat3

shareShare

Yifei Zhou

@yifeizhou02

2 months ago

It’s a really fun time working on this project. It turns out acting longer is the best axis to scale inference compute for agents. But it is tricky what’s the best way to do it!

thumb_up_off_alt8

chat_bubble_outline0

repeat0

shareShare

Kevin Frans

@kvfrans

2 months ago

Very excited for this one. We took a cautiously experimental view on NN optimizers, aiming to find something that just works. SPlus matches Adam within ~44% of steps on a range of objectives. Please try it out in your setting, or read below for how it works.

thumb_up_off_alt625

chat_bubble_outline37

repeat84

shareShare

Aviral Kumar

@aviral_kumar2

2 months ago

Lot of work in agents these days is using reasoning RL to now train agents. But is that good enough? Jack Bai & Junhong Shen show that its not: we also want RL to learn *how* to explore and *discover* novel behaviors, by scaling "in-context" interaction!

Lot of work in agents these days is using reasoning RL to now train agents. But is that good enough?

<a href="/jackbai_jkb/">Jack Bai</a> & <a href="/JunhongShen1/">Junhong Shen</a> show that its not: we also want RL to learn *how* to explore and *discover* novel behaviors, by scaling "in-context" interaction!

thumb_up_off_alt122

chat_bubble_outline1

repeat10

shareShare

Jyo Pari

@jyo_pari

2 months ago

What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.

thumb_up_off_alt3,3K

chat_bubble_outline124

repeat514

shareShare

WebAgentlab

@webagentlab

2 months ago

🗞️ Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction TTI (Test-Time Interaction), a curriculum-based online reinforcement learning (RL) approach that trains agents by adaptively adjusting their rollout lengths. Using a Gemma 3 12B model, TTI produces

thumb_up_off_alt3

chat_bubble_outline1

repeat1

shareShare

Seohong Park

@seohong_park

2 months ago

Q-learning is not yet scalable seohong.me/blog/q-learnin… I wrote a blog post about my thoughts on scalable RL algorithms. To be clear, I'm still highly optimistic about off-policy RL and Q-learning! I just think we haven't found the right solution yet (the post discusses why).

thumb_up_off_alt1,1K

chat_bubble_outline34

repeat174

shareShare

Amrith Setlur

@setlur_amrith

2 months ago

Introducing e3 🔥 Best <2B model on math 💪 Are LLMs implementing algos ⚒️ OR is thinking an illusion 🎩.? Is RL only sharpening the base LLM distrib. 🤔 OR discovering novel strategies outside base LLM 💡? We answer these ⤵️ 🚨 arxiv.org/abs/2506.09026 🚨 matthewyryang.github.io/e3/

thumb_up_off_alt86

chat_bubble_outline1

repeat20

shareShare

Aviral Kumar

@aviral_kumar2

2 months ago

Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems. Amrith Setlur & Matthew Yang's new work e3 shows how RL done with this view produces best <2B LLM on math that extrapolates beyond training budget. 🧵⬇️

Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems.

<a href="/setlur_amrith/">Amrith Setlur</a> & <a href="/matthewyryang/">Matthew Yang</a>'s new work e3 shows how RL done with this view produces best <2B LLM on math that extrapolates beyond training budget. 🧵⬇️

thumb_up_off_alt181

chat_bubble_outline2

repeat27

shareShare

xAI

@xai

a month ago

Introducing Grok 4, the world's most powerful AI model. Watch the livestream now: x.com/i/broadcasts/1…

thumb_up_off_alt15,15K

chat_bubble_outline3,3K

repeat4,4K

shareShare

Yifei Zhou

@yifeizhou02

a month ago

It's the most dramatic mindset shift since I paused my PhD at Berkeley and joined xai, joining the creation of the most intelligence AI model and the most efficient team that operates at theoretically optimal speed. The history is rolling and there is nothing in the way.

thumb_up_off_alt1,1K

chat_bubble_outline14

repeat52

shareShare