Yifei Zhou (@yifeizhou02) 's Twitter Profile
Yifei Zhou

@yifeizhou02

Visiting researcher @AIatMeta | PhD student @berkeley_ai, working on RL for foundation models

ID: 1547790583334285313

linkhttp://yifeizhou02.github.io calendar_today15-07-2022 03:50:20

217 Tweet

1,1K Followers

445 Following

Seohong Park (@seohong_park) 's Twitter Profile Photo

Is RL really scalable like other objectives? We found that just scaling up data and compute is *not* enough to enable RL to solve complex tasks. The culprit is the horizon. Paper: arxiv.org/abs/2506.04168 Thread ↓

Aviral Kumar (@aviral_kumar2) 's Twitter Profile Photo

Can offline RL methods do well on any problem, as we scale compute and data? In our new paper led by Seohong Park, we show that task horizon can fundamentally hinder scaling for offline RL, and how explicitly reducing task horizon can address this. arxiv.org/abs/2506.04168 🧵⬇️

Can offline RL methods do well on any problem, as we scale compute and data?

In our new paper led by <a href="/seohong_park/">Seohong Park</a>, we show that task horizon can fundamentally hinder scaling for offline RL, and how explicitly reducing task horizon can address this.
arxiv.org/abs/2506.04168

🧵⬇️
Aviral Kumar (@aviral_kumar2) 's Twitter Profile Photo

Looking back, some of the most effective methods that we've built for training LLM/VLM agents in multi-turn settings also *needed* to utilize such a hierarchical structure, e.g., ArCHer (yifeizhou02.github.io/archer.io/) by Yifei Zhou, further showing the promise behind such ideas.

Looking back, some of the most effective methods that  we've built for training LLM/VLM agents in multi-turn settings also *needed* to utilize such a hierarchical structure, e.g., ArCHer (yifeizhou02.github.io/archer.io/) by <a href="/YifeiZhou02/">Yifei Zhou</a>, further showing the promise behind such  ideas.
elvis (@omarsar0) 's Twitter Profile Photo

Self-Challenging LLM Agents Self-improving AI systems are starting to show up everywhere. Meta and colleagues present self-improvement for general multi-turn tool-use LLM agents. Pay attention to this one, devs! Here are my notes:

Self-Challenging LLM Agents

Self-improving AI systems are starting to show up everywhere.

Meta and colleagues present self-improvement for general multi-turn tool-use LLM agents.

Pay attention to this one, devs!

Here are my notes:
Sergey Levine (@svlevine) 's Twitter Profile Photo

I always found it puzzling how language models learn so much from next-token prediction, while video models learn so little from next frame prediction. Maybe it's because LLMs are actually brain scanners in disguise. Idle musings in my new blog post: sergeylevine.substack.com/p/language-mod…

Yifei Zhou (@yifeizhou02) 's Twitter Profile Photo

In this paper we explore he we can efficiently scale the inference time compute for agents. Instead of blindly scaling the number of tokens at each step, it would be much better to scale the number of interactions! Check out how we did it!

Yifei Zhou (@yifeizhou02) 's Twitter Profile Photo

It’s a really fun time working on this project. It turns out acting longer is the best axis to scale inference compute for agents. But it is tricky what’s the best way to do it!

Kevin Frans (@kvfrans) 's Twitter Profile Photo

Very excited for this one. We took a cautiously experimental view on NN optimizers, aiming to find something that just works. SPlus matches Adam within ~44% of steps on a range of objectives. Please try it out in your setting, or read below for how it works.

Very excited for this one. We took a cautiously experimental view on NN optimizers, aiming to find something that just works. 

SPlus matches Adam within ~44% of steps on a range of objectives. Please try it out in your setting, or read below for how it works.
Aviral Kumar (@aviral_kumar2) 's Twitter Profile Photo

Lot of work in agents these days is using reasoning RL to now train agents. But is that good enough? Jack Bai & Junhong Shen show that its not: we also want RL to learn *how* to explore and *discover* novel behaviors, by scaling "in-context" interaction!

Lot of work in agents these days is using reasoning RL to now train agents. But is that good enough?

<a href="/jackbai_jkb/">Jack Bai</a> &amp; <a href="/JunhongShen1/">Junhong Shen</a> show that its not: we also want RL to learn *how* to explore and *discover* novel behaviors, by scaling "in-context" interaction!
Jyo Pari (@jyo_pari) 's Twitter Profile Photo

What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.

What if an LLM could update its own weights?

Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs.

Self-editing is learned via RL, using the updated model’s downstream performance as reward.
WebAgentlab (@webagentlab) 's Twitter Profile Photo

🗞️ Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction TTI (Test-Time Interaction), a curriculum-based online reinforcement learning (RL) approach that trains agents by adaptively adjusting their rollout lengths. Using a Gemma 3 12B model, TTI produces

🗞️ Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

TTI (Test-Time Interaction), a curriculum-based online reinforcement learning (RL) approach that trains agents by adaptively adjusting their rollout lengths. Using a Gemma 3 12B model, TTI produces
Seohong Park (@seohong_park) 's Twitter Profile Photo

Q-learning is not yet scalable seohong.me/blog/q-learnin… I wrote a blog post about my thoughts on scalable RL algorithms. To be clear, I'm still highly optimistic about off-policy RL and Q-learning! I just think we haven't found the right solution yet (the post discusses why).

Q-learning is not yet scalable

seohong.me/blog/q-learnin…

I wrote a blog post about my thoughts on scalable RL algorithms.

To be clear, I'm still highly optimistic about off-policy RL and Q-learning! I just think we haven't found the right solution yet (the post discusses why).
Amrith Setlur (@setlur_amrith) 's Twitter Profile Photo

Introducing e3 🔥 Best <2B model on math 💪 Are LLMs implementing algos ⚒️ OR is thinking an illusion 🎩.? Is RL only sharpening the base LLM distrib. 🤔 OR discovering novel strategies outside base LLM 💡? We answer these ⤵️ 🚨 arxiv.org/abs/2506.09026 🚨 matthewyryang.github.io/e3/

Introducing e3 🔥 Best &lt;2B model on math 💪
Are LLMs implementing algos ⚒️ OR is thinking an illusion 🎩.? Is RL only sharpening the base LLM distrib. 🤔 OR discovering novel strategies outside base LLM 💡?  We answer these ⤵️
🚨 arxiv.org/abs/2506.09026
🚨 matthewyryang.github.io/e3/
Aviral Kumar (@aviral_kumar2) 's Twitter Profile Photo

Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems. Amrith Setlur & Matthew Yang's new work e3 shows how RL done with this view produces best <2B LLM on math that extrapolates beyond training budget. 🧵⬇️

Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems.

<a href="/setlur_amrith/">Amrith Setlur</a> &amp; <a href="/matthewyryang/">Matthew Yang</a>'s new work e3 shows how RL done with this view produces best &lt;2B LLM on math that extrapolates beyond training budget. 🧵⬇️
Yifei Zhou (@yifeizhou02) 's Twitter Profile Photo

It's the most dramatic mindset shift since I paused my PhD at Berkeley and joined xai, joining the creation of the most intelligence AI model and the most efficient team that operates at theoretically optimal speed. The history is rolling and there is nothing in the way.