Qiying Yu (@qiying_yu) 's Twitter Profile
Qiying Yu

@qiying_yu

PhD student at Tsinghua AIR @Tsinghua_Uni @AIRTHU1201

ID: 1627292036868214784

linkhttps://yqy2001.github.io calendar_today19-02-2023 13:00:54

80 Tweet

529 Followers

705 Following

Haibin (@eric_haibin_lin) 's Twitter Profile Photo

Qiying Yu and team just dropped the DAPO algorithm (decoupled clip and dynamic sampling policy optimization)! DAPO-Zero-32B, a fully open-source RL reasoning model, surpasses DeepSeek-R1-Zero-Qwen-32B, and scores 50 on AIME 2024 with 50% fewer steps. It is trained with

<a href="/qiying_yu/">Qiying Yu</a> and team just dropped the DAPO algorithm (decoupled clip and dynamic sampling policy optimization)! DAPO-Zero-32B, a fully open-source RL reasoning model, surpasses DeepSeek-R1-Zero-Qwen-32B, and scores 50 on AIME 2024 with 50% fewer steps. It is trained with
Marktechpost AI Research News ⚡ (@marktechpost) 's Twitter Profile Photo

ByteDance Research Releases DAPO: A Fully Open-Sourced LLM Reinforcement Learning System at Scale Researchers from ByteDance, Tsinghua University, and the University of Hong Kong recently introduced DAPO (Dynamic Sampling Policy Optimization), an open-source large-scale

ByteDance Research Releases DAPO: A Fully Open-Sourced LLM Reinforcement Learning System at Scale

Researchers from ByteDance, Tsinghua University, and the University of Hong Kong recently introduced DAPO (Dynamic Sampling Policy Optimization), an open-source large-scale
𝚐𝔪𝟾𝚡𝚡𝟾 (@gm8xx8) 's Twitter Profile Photo

DAPO: An Open-Source LLM Reinforcement Learning System at Scale DAPO is a reinforcement learning algorithm for large-scale LLM training, achieving 50 points on AIME 2024 with Qwen2.5-32B. It introduces four key techniques to improve LLM reasoning and provides open-source

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

DAPO is a reinforcement learning algorithm for large-scale LLM training, achieving 50 points on AIME 2024 with Qwen2.5-32B. It introduces four key techniques to improve LLM reasoning and provides open-source
Kyle Corbitt (@corbtt) 's Twitter Profile Photo

Lots of good nuggets here. Interestingly, they completely drop the KL divergence penalty and get good results. This mirrors what we're finding in our own experiments. Seems not to be so necessary for RLVR with GRPO. As a bonus, skipping it speeds up training significantly!

Philipp Schmid (@_philschmid) 's Twitter Profile Photo

New RL Method thats better than GRPO! 🤯ByteDance Open Source released a new open source RL method that outperforms GRPO. DAPO or Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) achieves 50 points on the AIME 2024 benchmark with 50% fewer training steps. TL;DR: 🏆 50%

New RL Method thats better than GRPO! 🤯<a href="/ByteDanceOSS/">ByteDance Open Source</a> released a new open source RL method that outperforms GRPO. DAPO or Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) achieves 50 points on the AIME 2024 benchmark with 50% fewer training steps.

TL;DR: 
🏆 50%
PapersAnon (@papers_anon) 's Twitter Profile Photo

DAPO: An Open-Source LLM Reinforcement Learning System at Scale From a joint ByteDance/Tsinghua team. Proposes the Decoupled Clip and Dynamic sAmpling Policy Optimization algorithm and fully open-sources a SOTA large-scale RL system. Both were used to achieve 50 points on AIME

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

From a joint ByteDance/Tsinghua team. Proposes the Decoupled Clip and Dynamic sAmpling Policy Optimization algorithm and fully open-sources a SOTA large-scale RL system. Both were used to achieve 50 points on AIME
elvis (@omarsar0) 's Twitter Profile Photo

DAPO: An Open-Source LLM Reinforcement Learning System at Scale It introduces DAPO, a fully open-source, large-scale RL system that boosts the chain-of-thought reasoning capabilities of LLMs. DAPO raises the upper clipping threshold (“Clip-Higher”) in PPO-style training,

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

It introduces DAPO, a fully open-source, large-scale RL system that boosts the chain-of-thought reasoning capabilities of LLMs.

DAPO raises the upper clipping threshold (“Clip-Higher”) in PPO-style training,
Qiying Yu (@qiying_yu) 's Twitter Profile Photo

Thank you AK for featuring our work. Excited to share valuable insights and open-source usefull systems to the community ! 🌟

TuringPost (@theturingpost) 's Twitter Profile Photo

A new RL algorithm! DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) from ByteDance Open Source is a fully open-source RL system, that improves training in long Chain-of-Thought (CoT) reasoning. It achieves 50 points on AIME 2024, surpassing DeepSeek-R1-Zero, using only

A new RL algorithm!

DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) from <a href="/ByteDanceOSS/">ByteDance Open Source</a> is a fully open-source RL system, that improves training in long Chain-of-Thought (CoT) reasoning.

It achieves 50 points on AIME 2024, surpassing DeepSeek-R1-Zero, using only
AK (@_akhaliq) 's Twitter Profile Photo

China's ByteDance presents VAPO Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks present VAPO, Value-based Augmented Proximal Policy Optimization framework for reasoning models. a novel framework tailored for reasoning models within the value-based

China's ByteDance presents VAPO

Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

present VAPO, Value-based Augmented Proximal Policy Optimization framework for reasoning models. a novel framework tailored for reasoning models within the value-based
Haibin (@eric_haibin_lin) 's Twitter Profile Photo

🚀 Introducing VAPO (Value-based augmented PPO), our latest RL method for reasoning models. Trained from Qwen-32B-base model, VAPO achieves 60.4 on AIME 2024, outperforming DeepSeek-zero-32B and DAPO-32B📈. Built with verl project, and yes, we will open source it soon. Key

🚀 Introducing VAPO (Value-based augmented PPO), our latest RL method for reasoning models. Trained from Qwen-32B-base model, VAPO achieves 60.4 on AIME 2024, outperforming DeepSeek-zero-32B and DAPO-32B📈. 

Built with <a href="/verl_project/">verl project</a>, and yes, we will open source it soon. 

Key
Quentin Gallouédec (@qgallouedec) 's Twitter Profile Photo

Overlong filtering has been shown to significantly stabilize learning and improve performance. You can now use it in TRL! It simply consists in masking the loss of truncated samples. Principle proposed by Qiying Yu in DAPO, implemented by Shirin Yamani 👏

Overlong filtering has been shown to significantly stabilize learning and improve performance. You can now use it in TRL!

It simply consists in masking the loss of truncated samples.

Principle proposed by <a href="/qiying_yu/">Qiying Yu</a> in DAPO, implemented by <a href="/shirinyamani/">Shirin Yamani</a> 👏
Qiying Yu (@qiying_yu) 's Twitter Profile Photo

#ICLR2025 I am going to present VAPO & DAPO twice at ICLR, two SOTA LLM RL algorithms. 1. The 1-2 pm verl Expo Talk, Apr 26, Peridot 202-203 2. The 3:00-3:30 pm break, Apr 24, at the ByteDance Booth Welcome and see you there!

#ICLR2025
I am going to present VAPO &amp; DAPO twice at ICLR, two SOTA LLM RL algorithms. 

1. The 1-2 pm verl Expo Talk, Apr 26, Peridot 202-203
2. The 3:00-3:30 pm break, Apr 24, at the ByteDance Booth

Welcome and see you there!