Nino Vieillard (@nino_vieillard) 's Twitter Profile
Nino Vieillard

@nino_vieillard

Research Scientist @Google

ID: 1400768358530928641

calendar_today04-06-2021 10:59:05

37 Tweet

222 Followers

159 Following

AK (@_akhaliq) 's Twitter Profile Photo

MusicLM: Generating Music From Text abs: arxiv.org/abs/2301.11325 project page: google-research.github.io/seanet/musiclm…

Toshinori Kitamura (@t_kitamura14) 's Twitter Profile Photo

Our recent work "Regularization and Variance-Weighted Regression Achieves Minimax Optimality in Linear MDPs: Theory and Practice" got accepted to ICLM2023 and is now on Arxiv! Code: github.com/matsuolab/Vari… Arxiv: arxiv.org/abs/2305.13185

Johan Ferret (@johanferret) 's Twitter Profile Photo

Our #ACL2023 paper "Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback" is now on arXiv! tl;dr - we improve the factuality of summaries via RL, without human feedback! 📜 arxiv.org/abs/2306.00186 Thread (1/10) 👇

Rishabh Agarwal (@agarwl_) 's Twitter Profile Photo

[1/3] Here's a simple but powerful idea to improve RLHF / RLAIF by combining it with knowledge distillation: Simply regularize the LLM policy (student) to a more capable teacher model instead of the base student model for the KL regularization term.

[1/3] Here's a simple but powerful idea to improve RLHF / RLAIF by combining it with knowledge distillation: Simply regularize the LLM policy (student) to a more capable teacher model instead of the base student model for the KL regularization term.
Rishabh Agarwal (@agarwl_) 's Twitter Profile Photo

I'm giving a talk about knowledge distillation in a few hours (1 pm ET) at DLCT reading group ML Collective. If you are interested in attending, please see instructions here: mlcollective.org/dlct/. Thanks to Rosanne Liu for organizing.

I'm giving a talk about knowledge distillation in a few hours (1 pm ET) at DLCT reading group <a href="/ml_collective/">ML Collective</a>.

If you are interested in attending, please see instructions here:  mlcollective.org/dlct/. Thanks to <a href="/savvyRL/">Rosanne Liu</a> for organizing.
Rishabh Agarwal (@agarwl_) 's Twitter Profile Photo

On-policy distillation of LLMs got accepted at ICLR. Also, 2 ICLR papers already use GKD to improve speculative decoding! Idea: Sample self-generated output sequences from student, run inference on teacher to get logits, and minimize mismatch b/w student and teacher logits.

On-policy distillation of LLMs got accepted at ICLR. Also, 2 ICLR papers already use GKD to improve speculative decoding!

Idea: Sample self-generated output sequences from student, run inference on teacher to get logits, and minimize mismatch b/w student and teacher logits.
AK (@_akhaliq) 's Twitter Profile Photo

Google Deepmind presents WARM On the Benefits of Weight Averaged Reward Models paper page: huggingface.co/papers/2401.12… Aligning large language models (LLMs) with human preferences through reinforcement learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the

Google Deepmind presents WARM

On the Benefits of Weight Averaged Reward Models

paper page: huggingface.co/papers/2401.12…

Aligning large language models (LLMs) with human preferences through reinforcement learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the
Alexandre Ramé (@ramealexandre) 's Twitter Profile Photo

Introducing DeepMind's Weight Averaged Reward Model (WARM) for alignment via RLHF! We merge multiple reward models into one that's more reliable and robust. WARM efficiently captures the best of each to mitigate reward hacking. A thread 🧵 below. arxiv.org/abs/2401.12187

AK (@_akhaliq) 's Twitter Profile Photo

Google presents MusicRL Aligning Music Generation to Human Preferences paper page: huggingface.co/papers/2402.04… propose MusicRL, the first music generation system finetuned from human feedback. Appreciation of text-to-music models is particularly subjective since the concept of

Google presents MusicRL

Aligning Music Generation to Human Preferences

paper page: huggingface.co/papers/2402.04…

propose MusicRL, the first music generation system finetuned from human feedback. Appreciation of text-to-music models is particularly subjective since the concept of
Alexandre Ramé (@ramealexandre) 's Twitter Profile Photo

Introducing Weight Averaged Rewarded Policies (WARP), Google DeepMind's latest RLHF alignment method using the magic of model merging. By scaling alignment like pre-training was scaled, WARP learns sota Gemma LLM surpassing previous releases. A 🧵below. arxiv.org/abs/2406.16768

Robert Dadashi (@robdadashi) 's Twitter Profile Photo

I am so proud to announce that: - Gemma 2 27B IT tops all open weights models on Chatbot Arena, with a pinch of optimism in the face of uncertainty :) - Gemma 2 9B IT sets a new frontier for models of similar size. 1/n

I am so proud to announce that:
- Gemma 2 27B IT tops all open weights models on Chatbot Arena, with a pinch of optimism in the face of uncertainty :)
- Gemma 2 9B IT sets a new frontier for models of similar size.
1/n
Pier Giuseppe Sessa (@piergsessa) 's Twitter Profile Photo

We are delighted to introduce🕴️J-BOND🕴️, a novel RLHF algorithm to align LLMs via Best-of-N Distillation: arxiv.org/abs/2407.14622

Robert Dadashi (@robdadashi) 's Twitter Profile Photo

Gemma 2 2B is here! Fantastic performance for size, it's great for research and applications. I am very proud of the progress our team made over the last few months!

Gemma 2 2B is here! 

Fantastic performance for size, it's great for research and applications.

I am very proud of the progress our team made over the last few months!
Daniil Tiapkin (@dtiapkin) 's Twitter Profile Photo

1/ If you’re familiar with RLHF, you likely heard of reward hacking —where over-optimizing the imperfect reward model leads to unintended behaviors. But what about teacher hacking in knowledge distillation: can the teacher be hacked, like rewards in RLHF?

1/ If you’re familiar with RLHF, you likely heard of reward hacking —where over-optimizing the imperfect reward model leads to unintended behaviors. But what about teacher hacking in knowledge distillation: can the teacher be hacked, like rewards in RLHF?
Alexandre Ramé (@ramealexandre) 's Twitter Profile Photo

Modern post-training is essentially distillation then RL. While reward hacking is well-known and feared, could there be such a thing as teacher hacking? Our latest paper confirms it. Fortunately, we also show how to mitigate it! The secret: diversity, as always ^^

Olivier Bachem (@olivierbachem) 's Twitter Profile Photo

Really excited that we can finally share Gemma 3 with the world. The whole team spent a lot of hard work on this and the results speak for themselves: Being able to fit a top 10 LMSys model on a single accelerator will enable so many people to benefit from strong models.

Robert Dadashi (@robdadashi) 's Twitter Profile Photo

Today we are releasing the best open-weights model you can run on a single device reaching 1339 Elo on LMsys for Gemma 3 27B (aka zizou-10)! Very strong capabilities on math, multilingual, coding, instruction following, function calling !

Johan Ferret (@johanferret) 's Twitter Profile Photo

Glad to announce Gemma 3, our newest addition to the Gemma family of open models! Gemma 3 models are long context (128k), multimodal (image as input for 4B+) and significantly better at math, coding, reasoning and multilinguality. Post-trained with love from 🇫🇷 🇨🇭 🇳🇱

Glad to announce Gemma 3, our newest addition to the Gemma family of open models! 
Gemma 3 models are long context (128k), multimodal (image as input for 4B+) and significantly better at math, coding, reasoning and multilinguality.
Post-trained with love from 🇫🇷 🇨🇭 🇳🇱