Yuda Song @ ICLR 2025 (@yus167) Twitter Tweets • TwiCopy

Zhengyi “Zen” Luo

6 months ago

🎓 Excited to defend my PhD thesis “Learning Universal Humanoid Control” at CMU this Friday! From scalable motion imitators to visual dexterous whole-body policies — it’s been a wild ride 🤖✨ 📅 April 25, 2025 📍 CMU RI & online 🔗 cs.cmu.edu/calendar/18255…

thumb_up_off_alt492

chat_bubble_outline9

repeat55

shareShare

Runzhe Wu @ICLR2025

@runzhe_wu

6 months ago

#ICLR2025 Oral 🚨 Provably efficient RL has advanced significantly but it's still unclear if efficient algos exist for the simple setting of "Linear Bellman Completeness" We solve for the special case of deterministic state transitions using an approach we call "span argument"!🧵

thumb_up_off_alt23

chat_bubble_outline1

repeat6

shareShare

Yutong (Kelly) He

@electronickale

6 months ago

✨ Love 4o-style image generation but prefer to use Midjourney? Tired of manual prompt crafting from inspo images? PRISM to the rescue! 🖼️→📝→🖼️ We automate black-box prompt engineering—no training, no embeddings, just accurate, readable prompts from your inspo images! 1/🧵

thumb_up_off_alt83

chat_bubble_outline2

repeat31

shareShare

Runtian Zhai

@runtianzhai

6 months ago

Why can foundation models transfer to so many downstream tasks? Will the scaling law end? Will pretraining end like Ilya Sutskever predicted? My PhD thesis builds the contexture theory to answer the above. Blog: runtianzhai.com/thesis Paper: arxiv.org/abs/2504.19792 🧵1/12

thumb_up_off_alt161

chat_bubble_outline2

repeat32

shareShare

Keegan Harris

@keegan_w_harris

6 months ago

Back in March, I wore a head-mounted camera for a week straight and fine-tuned ChatGPT on the resulting data. Here's what happened (1/6) arxiv.org/pdf/2504.03857

thumb_up_off_alt21

chat_bubble_outline2

repeat4

shareShare

Aurora

@aurora_inno

6 months ago

Self-driving freight is here. We’ve launched driverless operations in Texas, marking the first time heavy-duty trucks are hauling commercial freight on public roads with no one behind the wheel. We’re proud to lead this industry-defining milestone – paving the way for safer roads

thumb_up_off_alt469

chat_bubble_outline47

repeat120

shareShare

Rattana Pukdee

@rpukdeee

6 months ago

In our #AISTATS2025 paper, we ask: when it is possible to recover a consistent joint distribution from conditionals? We propose path consistency and autoregressive path consistency—necessary and easily verifiable conditions. See you at Poster session 3, Monday 5th May.

thumb_up_off_alt13

chat_bubble_outline1

repeat6

shareShare

Dylan Foster 🐢

@canondetortugas

6 months ago

Is Best-of-N really the best we can do for language model inference? New algo & paper: 🚨InferenceTimePessimism🚨 Led by the amazing Audrey Huang (Audrey Huang) with Adam Block, Qinghua Liu, Nan Jiang (Nan Jiang), and Akshay Krishnamurthy. Appearing at ICML '25. 1/11

Is Best-of-N really the best we can do for language model inference?

New algo & paper: 🚨InferenceTimePessimism🚨

Led by the amazing Audrey Huang (<a href="/auddery/">Audrey Huang</a>) with Adam Block, Qinghua Liu, Nan Jiang (<a href="/nanjiang_cs/">Nan Jiang</a>), and Akshay Krishnamurthy. Appearing at ICML '25.

1/11

thumb_up_off_alt192

chat_bubble_outline2

repeat24

shareShare

Yifei Zhou

@yifeizhou02

6 months ago

With previous research in multimodal and agents, I believe the only truly useful multimodal agent before 2027 is multimodal co-creations in structured formats. Sharing my first blogpost, cuz I do not quite see this point of view around but can be quite impacful to the society.

thumb_up_off_alt114

chat_bubble_outline6

repeat15

shareShare

Antoine Moulin

@antoine_mln

5 months ago

new preprint with the amazing Luca Viano and Gergely Neu on offline imitation learning! when the expert is hard to represent but the environment is simple, estimating a Q-value rather than the expert directly may be beneficial. there are many open questions left though!

new preprint with the amazing <a href="/LucaViano4/">Luca Viano</a> and <a href="/neu_rips/">Gergely Neu</a> on offline imitation learning!

when the expert is hard to represent but the environment is simple, estimating a Q-value rather than the expert directly may be beneficial. there are many open questions left though!

thumb_up_off_alt44

chat_bubble_outline1

repeat6

shareShare

Lili

@lchen915

5 months ago

One fundamental issue with RL – whether it’s for robots or LLMs – is how hard it is to get rewards. For LLM reasoning, we need ground-truth labels to verify answers. We found that maximizing confidence alone allows LLMs to improve their reasoning with RL!

thumb_up_off_alt129

chat_bubble_outline5

repeat26

shareShare

Andrea Zanette

@zanette_ai

5 months ago

Can Large Reasoning Models Self-Train? We propose Self-Rewarded Training (SRT)—where LLMs generate their own supervision. Main findings: SRT initially matches RL on ground truth, but sustained training risks reward hacking. We also investigate mitigation strategies.

thumb_up_off_alt52

chat_bubble_outline1

repeat7

shareShare

Fahim Tajwar

@fahimtajwar10

5 months ago

RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers? Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training! 🧵 1/n

thumb_up_off_alt819

chat_bubble_outline20

repeat136

shareShare

Yifei Zhou

@yifeizhou02

5 months ago

SCA is the first self-improvement rl framework for general multi-turn tool-use agents. It does so by first generating its own verifiers for its own synthetic tasks. Stay tuned for more details!

thumb_up_off_alt69

chat_bubble_outline0

repeat11

shareShare

Nimit Kalra

@qw3rtman

5 months ago

Still noodling on this, but the generation-verification gap proposed by Yuda Song Hanlin Zhang Sham Kakade Udaya Ghai et al. in arxiv.org/abs/2412.02674 is a very nice framework that unifies a lot of thoughts around self-improvement/verification/bootstrapping reasoning

thumb_up_off_alt21

chat_bubble_outline1

repeat2

shareShare

Gokul Swamy

@g_k_swamy

5 months ago

Say ahoy to 𝚂𝙰𝙸𝙻𝙾𝚁⛵: a new paradigm of *learning to search* from demonstrations, enabling test-time reasoning about how to recover from mistakes w/o any additional human feedback! 𝚂𝙰𝙸𝙻𝙾𝚁 ⛵ out-performs Diffusion Policies trained via behavioral cloning on 5-10x data!

thumb_up_off_alt247

chat_bubble_outline10

repeat64

shareShare

Yifei Zhou

@yifeizhou02

5 months ago

In this paper we explore he we can efficiently scale the inference time compute for agents. Instead of blindly scaling the number of tokens at each step, it would be much better to scale the number of interactions! Check out how we did it!

thumb_up_off_alt29

chat_bubble_outline0

repeat3

shareShare

Zhaolin Gao

@gaozhaolin

5 months ago

Current RLVR methods like GRPO and PPO require explicit critics or multiple generations per prompt, resulting in high computational and memory costs. We introduce ⭐A*-PO, a policy optimization algorithm that uses only a single sample per prompt during online RL without critic.

thumb_up_off_alt220

chat_bubble_outline7

repeat29

shareShare

Nimit Kalra

@qw3rtman

4 months ago

Discussing "Mind the Gap" tonight at Haize Labs's NYC AI Reading Group with Leonard Tang and will brown. Authors study self-improvement through the "Generation-Verification Gap" (model's verification ability over its own generations) and find that this capability log scales with

Discussing "Mind the Gap" tonight at <a href="/haizelabs/">Haize Labs</a>'s NYC AI Reading Group with <a href="/leonardtang_/">Leonard Tang</a> and <a href="/willccbb/">will brown</a>. Authors study self-improvement through the "Generation-Verification Gap" (model's verification ability over its own generations) and find that this capability log scales with

thumb_up_off_alt61

chat_bubble_outline3

repeat9

shareShare

Charles Arnal

@arnal_charles

4 months ago

❓How to balance negative and positive rewards in off-policy RL❓ In Asymmetric REINFORCE for off-Policy RL, we show that giving less weight to negative rewards is enough to stabilize off-policy RL training for LLMs! 💪 (1/8) Paper: arxiv.org/abs/2506.20520

thumb_up_off_alt151

chat_bubble_outline2

repeat27

shareShare