Quentin Gallouédec (@qgallouedec) 's Twitter Profile
Quentin Gallouédec

@qgallouedec

PhD - Research engineer @huggingface 🤗
TRL maintainer
📦➡️🦋 bsky.app/profile/qgallo…

ID: 1127913981526540288

calendar_today13-05-2019 12:30:23

495 Tweet

2,2K Followers

553 Following

Quentin Gallouédec (@qgallouedec) 's Twitter Profile Photo

You shouldn't do RL on small model. Distilling from large models works better. And you can now do it even when tokenizers don't match.

Quentin Gallouédec (@qgallouedec) 's Twitter Profile Photo

Questions! 🧐 LayerNorm always upcasts inputs to fp32 for stability (hardcoded). But the final multiplication by the weights is in the original dtype. 1. Why? Sometimes we do this multiplication in fp32. 2. When and why?

Lewis Tunstall (@_lewtun) 's Twitter Profile Photo

In the Smol Training Playbook, I tried to survey the state of popular post-training frameworks. Let me know if I missed any and I'll add them to the list!

In the Smol Training Playbook, I tried to survey the state of popular post-training frameworks.

Let me know if I missed any and I'll add them to the list!
steven (@tu7uruu) 's Twitter Profile Photo

Here is a tutorial on training LLaSA (LLaMA-based TTS) using GRPO to improve prosody, rhythm, and expressiveness in synthesized speech with TRL!

Muyu He (@hemuyu0327) 's Twitter Profile Photo

On-policy distillation is powerful, but Thinking Machines's tinker only supports distilling from a teacher model within the same family, making it impossible for qwen to learn from deepseek, gpt-oss, etc. For the first time, we enabled model-agnostic distillations natively using

On-policy distillation is powerful, but <a href="/thinkymachines/">Thinking Machines</a>'s tinker only supports distilling from a teacher model within the same family, making it impossible for qwen to learn from deepseek, gpt-oss, etc.

For the first time, we enabled model-agnostic distillations natively using
Benny (Yufei) Chen (@the_bunny_chen) 's Twitter Profile Photo

Reinforcement Learning for agents has been held back by a lack of standard infrastructure. Production agents don't live in clean "gyms"—they live in messy, async environments. Today we’re open-sourcing Eval Protocol: a framework to run RL directly on your production agents. Day

Reinforcement Learning for agents has been held back by a lack of standard infrastructure. Production agents don't live in clean "gyms"—they live in messy, async environments.

Today we’re open-sourcing Eval Protocol: a framework to run RL directly on your production agents. Day