Yuda Song @ ICLR 2025 (@yus167) 's Twitter Profile
Yuda Song @ ICLR 2025

@yus167

PhD @mldcmu. Previously @ucsd_cse @UcsdMathDept

ID: 1250678066742874113

linkhttps://yudasong.github.io calendar_today16-04-2020 06:51:08

113 Tweet

359 Followers

260 Following

Zhengyi “Zen” Luo (@zhengyiluo) 's Twitter Profile Photo

🎓 Excited to defend my PhD thesis “Learning Universal Humanoid Control” at CMU this Friday! From scalable motion imitators to visual dexterous whole-body policies — it’s been a wild ride 🤖✨ 📅 April 25, 2025 📍 CMU RI & online 🔗 cs.cmu.edu/calendar/18255…

Runzhe Wu @ICLR2025 (@runzhe_wu) 's Twitter Profile Photo

#ICLR2025 Oral 🚨 Provably efficient RL has advanced significantly but it's still unclear if efficient algos exist for the simple setting of "Linear Bellman Completeness" We solve for the special case of deterministic state transitions using an approach we call "span argument"!🧵

#ICLR2025 Oral 🚨 Provably efficient RL has advanced significantly but it's still unclear if efficient algos exist for the simple setting of "Linear Bellman Completeness" We solve for the special case of deterministic state transitions using an approach we call "span argument"!🧵
Yutong (Kelly) He (@electronickale) 's Twitter Profile Photo

✨ Love 4o-style image generation but prefer to use Midjourney? Tired of manual prompt crafting from inspo images? PRISM to the rescue! 🖼️→📝→🖼️ We automate black-box prompt engineering—no training, no embeddings, just accurate, readable prompts from your inspo images! 1/🧵

Runtian Zhai (@runtianzhai) 's Twitter Profile Photo

Why can foundation models transfer to so many downstream tasks? Will the scaling law end? Will pretraining end like Ilya Sutskever predicted? My PhD thesis builds the contexture theory to answer the above. Blog: runtianzhai.com/thesis Paper: arxiv.org/abs/2504.19792 🧵1/12

Keegan Harris (@keegan_w_harris) 's Twitter Profile Photo

Back in March, I wore a head-mounted camera for a week straight and fine-tuned ChatGPT on the resulting data. Here's what happened (1/6) arxiv.org/pdf/2504.03857

Back in March, I wore a head-mounted camera for a week straight and fine-tuned ChatGPT on the resulting data. Here's what happened (1/6)

arxiv.org/pdf/2504.03857
Aurora (@aurora_inno) 's Twitter Profile Photo

Self-driving freight is here. We’ve launched driverless operations in Texas, marking the first time heavy-duty trucks are hauling commercial freight on public roads with no one behind the wheel. We’re proud to lead this industry-defining milestone – paving the way for safer roads

Self-driving freight is here. We’ve launched driverless operations in Texas, marking the first time heavy-duty trucks are hauling commercial freight on public roads with no one behind the wheel. We’re proud to lead this industry-defining milestone – paving the way for safer roads
Rattana Pukdee (@rpukdeee) 's Twitter Profile Photo

In our #AISTATS2025 paper, we ask: when it is possible to recover a consistent joint distribution from conditionals? We propose path consistency and autoregressive path consistency—necessary and easily verifiable conditions. See you at Poster session 3, Monday 5th May.

In our #AISTATS2025 paper, we ask: when it is possible to recover a consistent joint distribution from conditionals? We propose path consistency and autoregressive path consistency—necessary and easily verifiable conditions. 

See you at Poster session 3, Monday 5th May.
Dylan Foster 🐢 (@canondetortugas) 's Twitter Profile Photo

Is Best-of-N really the best we can do for language model inference?   New algo & paper: 🚨InferenceTimePessimism🚨 Led by the amazing Audrey Huang (Audrey Huang) with Adam Block, Qinghua Liu, Nan Jiang (Nan Jiang), and Akshay Krishnamurthy. Appearing at ICML '25. 1/11

Is Best-of-N really the best we can do for language model inference?  

New algo & paper: 🚨InferenceTimePessimism🚨

Led by the amazing Audrey Huang (<a href="/auddery/">Audrey Huang</a>) with Adam Block, Qinghua Liu, Nan Jiang (<a href="/nanjiang_cs/">Nan Jiang</a>), and Akshay Krishnamurthy. Appearing at ICML '25.

1/11
Yifei Zhou (@yifeizhou02) 's Twitter Profile Photo

With previous research in multimodal and agents, I believe the only truly useful multimodal agent before 2027 is multimodal co-creations in structured formats. Sharing my first blogpost, cuz I do not quite see this point of view around but can be quite impacful to the society.

Antoine Moulin (@antoine_mln) 's Twitter Profile Photo

new preprint with the amazing Luca Viano and Gergely Neu on offline imitation learning! when the expert is hard to represent but the environment is simple, estimating a Q-value rather than the expert directly may be beneficial. there are many open questions left though!

new preprint with the amazing <a href="/LucaViano4/">Luca Viano</a> and <a href="/neu_rips/">Gergely Neu</a> on offline imitation learning!

when the expert is hard to represent but the environment is simple, estimating a Q-value rather than the expert directly may be beneficial. there are many open questions left though!
Lili (@lchen915) 's Twitter Profile Photo

One fundamental issue with RL – whether it’s for robots or LLMs – is how hard it is to get rewards. For LLM reasoning, we need ground-truth labels to verify answers. We found that maximizing confidence alone allows LLMs to improve their reasoning with RL!

Andrea Zanette (@zanette_ai) 's Twitter Profile Photo

Can Large Reasoning Models Self-Train? We propose Self-Rewarded Training (SRT)—where LLMs generate their own supervision. Main findings: SRT initially matches RL on ground truth, but sustained training risks reward hacking. We also investigate mitigation strategies.

Fahim Tajwar (@fahimtajwar10) 's Twitter Profile Photo

RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers? Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training! 🧵 1/n

RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers?

Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training!

🧵 1/n
Yifei Zhou (@yifeizhou02) 's Twitter Profile Photo

SCA is the first self-improvement rl framework for general multi-turn tool-use agents. It does so by first generating its own verifiers for its own synthetic tasks. Stay tuned for more details!

Nimit Kalra (@qw3rtman) 's Twitter Profile Photo

Still noodling on this, but the generation-verification gap proposed by Yuda Song Hanlin Zhang Sham Kakade Udaya Ghai et al. in arxiv.org/abs/2412.02674 is a very nice framework that unifies a lot of thoughts around self-improvement/verification/bootstrapping reasoning

Gokul Swamy (@g_k_swamy) 's Twitter Profile Photo

Say ahoy to 𝚂𝙰𝙸𝙻𝙾𝚁⛵: a new paradigm of *learning to search* from demonstrations, enabling test-time reasoning about how to recover from mistakes w/o any additional human feedback! 𝚂𝙰𝙸𝙻𝙾𝚁 ⛵ out-performs Diffusion Policies trained via behavioral cloning on 5-10x data!

Yifei Zhou (@yifeizhou02) 's Twitter Profile Photo

In this paper we explore he we can efficiently scale the inference time compute for agents. Instead of blindly scaling the number of tokens at each step, it would be much better to scale the number of interactions! Check out how we did it!

Zhaolin Gao (@gaozhaolin) 's Twitter Profile Photo

Current RLVR methods like GRPO and PPO require explicit critics or multiple generations per prompt, resulting in high computational and memory costs. We introduce ⭐A*-PO, a policy optimization algorithm that uses only a single sample per prompt during online RL without critic.

Current RLVR methods like GRPO and PPO require explicit critics or multiple generations per prompt, resulting in high computational and memory costs. We introduce ⭐A*-PO, a policy optimization algorithm that uses only a single sample per prompt during online RL without critic.
Nimit Kalra (@qw3rtman) 's Twitter Profile Photo

Discussing "Mind the Gap" tonight at Haize Labs's NYC AI Reading Group with Leonard Tang and will brown. Authors study self-improvement through the "Generation-Verification Gap" (model's verification ability over its own generations) and find that this capability log scales with

Discussing "Mind the Gap" tonight at <a href="/haizelabs/">Haize Labs</a>'s NYC AI Reading Group with <a href="/leonardtang_/">Leonard Tang</a> and <a href="/willccbb/">will brown</a>. Authors study self-improvement through the "Generation-Verification Gap" (model's verification ability over its own generations) and find that this capability log scales with