Chongyi Zheng (@chongyiz1) 's Twitter Profile
Chongyi Zheng

@chongyiz1

PhD student @ Princeton working on RL.

ID: 1468793850139639809

linkhttps://chongyi-zheng.github.io/ calendar_today09-12-2021 04:05:23

48 Tweet

169 Followers

129 Following

Kevin Frans (@kvfrans) 's Twitter Profile Photo

Over the past year, I've been compiling some "alchemist's notes" on deep learning. Right now it covers basic optimization, architectures, and generative models. Focus is on learnability -- each page has nice graphics and an end-to-end implementation. notes.kvfrans.com

Over the past year, I've been compiling some "alchemist's notes" on deep learning. Right now it covers basic optimization, architectures, and generative models.

Focus is on learnability -- each page has nice graphics and an end-to-end implementation.

notes.kvfrans.com
Younggyo Seo (@younggyoseo) 's Twitter Profile Photo

Excited to present FastTD3: a simple, fast, and capable off-policy RL algorithm for humanoid control -- with an open-source code to run your own humanoid RL experiments in no time! Thread below 🧵

Kevin Frans (@kvfrans) 's Twitter Profile Photo

Stare at policy improvement and diffusion guidance, and you may notice a suspicious similarity... We lay out an equivalence between the two, formalizing a simple technique (CFGRL) to improve performance across-the-board when training diffusion policies. arxiv.org/abs/2505.23458

Stare at policy improvement and diffusion guidance, and you may notice a suspicious similarity...

We lay out an equivalence between the two, formalizing a simple technique (CFGRL) to improve performance across-the-board when training diffusion policies.

arxiv.org/abs/2505.23458
Seohong Park (@seohong_park) 's Twitter Profile Photo

We found a way to do RL *only* with BC policies. The idea is simple: 1. Train a BC policy π(a|s) 2. Train a conditional BC policy π(a|s, z) 3. Amplify(!) the difference between π(a|s, z) and π(a|s) using CFG Here, z can be anything (e.g., goals for goal-conditioned RL). 🧵↓

We found a way to do RL *only* with BC policies.

The idea is simple:

1. Train a BC policy π(a|s)
2. Train a conditional BC policy π(a|s, z)
3. Amplify(!) the difference between π(a|s, z) and π(a|s) using CFG

Here, z can be anything (e.g., goals for goal-conditioned RL).

🧵↓
Seohong Park (@seohong_park) 's Twitter Profile Photo

Is RL really scalable like other objectives? We found that just scaling up data and compute is *not* enough to enable RL to solve complex tasks. The culprit is the horizon. Paper: arxiv.org/abs/2506.04168 Thread ↓

John Zhou (@johnlyzhou) 's Twitter Profile Photo

Hierarchical methods for offline goal-conditioned RL (GCRL) can scale to very distant goals that stymie flat (non-hierarchical) policies — but are they really necessary? Paper: arxiv.org/abs/2505.14975 Project page: johnlyzhou.github.io/saw/ Code: github.com/johnlyzhou/saw Thread ↓

Kevin Frans (@kvfrans) 's Twitter Profile Photo

Very excited for this one. We took a cautiously experimental view on NN optimizers, aiming to find something that just works. SPlus matches Adam within ~44% of steps on a range of objectives. Please try it out in your setting, or read below for how it works.

Very excited for this one. We took a cautiously experimental view on NN optimizers, aiming to find something that just works. 

SPlus matches Adam within ~44% of steps on a range of objectives. Please try it out in your setting, or read below for how it works.
Seohong Park (@seohong_park) 's Twitter Profile Photo

New paper on unsupervised pre-training for RL! The idea is to learn a flow-based future prediction model for each "intention" in the dataset. We can then use these models to estimate values for fine-tuning.

Sergey Levine (@svlevine) 's Twitter Profile Photo

Unsupervised RL with intention-conditioned models provides a really interesting combination of predictive modeling and counterfactual learning (i.e., control). Getting such methods to work at scale has always been a challenge, but it's getting closer!

Ben Eysenbach (@ben_eysenbach) 's Twitter Profile Photo

What makes RL hard is the _time_ axis⏳, so let's pre-train RL policies to learn about _time_! Same intuition as successor representations 🧠, but made scalable with modern GenAI models 🚀. Excited to share new work led by Chongyi Zheng, together with Seohong Park and Sergey Levine!

Seohong Park (@seohong_park) 's Twitter Profile Photo

Q-learning is not yet scalable seohong.me/blog/q-learnin… I wrote a blog post about my thoughts on scalable RL algorithms. To be clear, I'm still highly optimistic about off-policy RL and Q-learning! I just think we haven't found the right solution yet (the post discusses why).

Q-learning is not yet scalable

seohong.me/blog/q-learnin…

I wrote a blog post about my thoughts on scalable RL algorithms.

To be clear, I'm still highly optimistic about off-policy RL and Q-learning! I just think we haven't found the right solution yet (the post discusses why).
Qiyang Li (@qiyang_li) 's Twitter Profile Photo

Everyone knows action chunking is great for imitation learning. It turns out that we can extend its success to RL to better leverage prior data for improved exploration and online sample efficiency! colinqiyangli.github.io/qc/ The recipe to achieve this is incredibly simple. 🧵 1/N