Nino Vieillard (@nino_vieillard) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

AK

@_akhaliq

2 years ago

MusicLM: Generating Music From Text abs: arxiv.org/abs/2301.11325 project page: google-research.github.io/seanet/musiclm…

thumb_up_off_alt1,1K

chat_bubble_outline32

repeat335

shareShare

Our recent work "Regularization and Variance-Weighted Regression Achieves Minimax Optimality in Linear MDPs: Theory and Practice" got accepted to ICLM2023 and is now on Arxiv! Code: github.com/matsuolab/Vari… Arxiv: arxiv.org/abs/2305.13185

thumb_up_off_alt41

chat_bubble_outline2

repeat10

shareShare

Johan Ferret

@johanferret

2 years ago

Our #ACL2023 paper "Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback" is now on arXiv! tl;dr - we improve the factuality of summaries via RL, without human feedback! 📜 arxiv.org/abs/2306.00186 Thread (1/10) 👇

thumb_up_off_alt87

chat_bubble_outline4

repeat25

shareShare

Rishabh Agarwal

@agarwl_

2 years ago

[1/3] Here's a simple but powerful idea to improve RLHF / RLAIF by combining it with knowledge distillation: Simply regularize the LLM policy (student) to a more capable teacher model instead of the base student model for the KL regularization term.

thumb_up_off_alt213

chat_bubble_outline1

repeat34

shareShare

Rishabh Agarwal

@agarwl_

2 years ago

I'm giving a talk about knowledge distillation in a few hours (1 pm ET) at DLCT reading group ML Collective. If you are interested in attending, please see instructions here: mlcollective.org/dlct/. Thanks to Rosanne Liu for organizing.

I'm giving a talk about knowledge distillation in a few hours (1 pm ET) at DLCT reading group <a href="/ml_collective/">ML Collective</a>.

If you are interested in attending, please see instructions here: mlcollective.org/dlct/. Thanks to <a href="/savvyRL/">Rosanne Liu</a> for organizing.

thumb_up_off_alt75

chat_bubble_outline0

repeat10

shareShare

Rishabh Agarwal

@agarwl_

a year ago

On-policy distillation of LLMs got accepted at ICLR. Also, 2 ICLR papers already use GKD to improve speculative decoding! Idea: Sample self-generated output sequences from student, run inference on teacher to get logits, and minimize mismatch b/w student and teacher logits.

thumb_up_off_alt200

chat_bubble_outline2

repeat24

shareShare

AK

@_akhaliq

a year ago

Google Deepmind presents WARM On the Benefits of Weight Averaged Reward Models paper page: huggingface.co/papers/2401.12… Aligning large language models (LLMs) with human preferences through reinforcement learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the

thumb_up_off_alt342

chat_bubble_outline7

repeat61

shareShare

Alexandre Ramé

@ramealexandre

a year ago

Introducing DeepMind's Weight Averaged Reward Model (WARM) for alignment via RLHF! We merge multiple reward models into one that's more reliable and robust. WARM efficiently captures the best of each to mitigate reward hacking. A thread 🧵 below. arxiv.org/abs/2401.12187

thumb_up_off_alt381

chat_bubble_outline8

repeat72

shareShare

Alex Graveley

@alexgraveley

a year ago

Deepmind legitimizing crackpot model merging!

thumb_up_off_alt7

chat_bubble_outline0

repeat2

shareShare

AK

@_akhaliq

a year ago

Google presents MusicRL Aligning Music Generation to Human Preferences paper page: huggingface.co/papers/2402.04… propose MusicRL, the first music generation system finetuned from human feedback. Appreciation of text-to-music models is particularly subjective since the concept of

thumb_up_off_alt277

chat_bubble_outline0

repeat53

shareShare

Alexandre Ramé

@ramealexandre

a year ago

Introducing Weight Averaged Rewarded Policies (WARP), Google DeepMind's latest RLHF alignment method using the magic of model merging. By scaling alignment like pre-training was scaled, WARP learns sota Gemma LLM surpassing previous releases. A 🧵below. arxiv.org/abs/2406.16768

thumb_up_off_alt228

chat_bubble_outline6

repeat33

shareShare

Robert Dadashi

@robdadashi

a year ago

I am so proud to announce that: - Gemma 2 27B IT tops all open weights models on Chatbot Arena, with a pinch of optimism in the face of uncertainty :) - Gemma 2 9B IT sets a new frontier for models of similar size. 1/n

thumb_up_off_alt126

chat_bubble_outline4

repeat24

shareShare

Rishabh Agarwal

@agarwl_

a year ago

On-policy distillation GOOD

thumb_up_off_alt24

chat_bubble_outline0

repeat3

shareShare

Pier Giuseppe Sessa

@piergsessa

a year ago

We are delighted to introduce🕴️J-BOND🕴️, a novel RLHF algorithm to align LLMs via Best-of-N Distillation: arxiv.org/abs/2407.14622

thumb_up_off_alt82

chat_bubble_outline5

repeat17

shareShare

Robert Dadashi

@robdadashi

a year ago

Gemma 2 2B is here! Fantastic performance for size, it's great for research and applications. I am very proud of the progress our team made over the last few months!

thumb_up_off_alt177

chat_bubble_outline4

repeat31

shareShare

Daniil Tiapkin

@dtiapkin

5 months ago

1/ If you’re familiar with RLHF, you likely heard of reward hacking —where over-optimizing the imperfect reward model leads to unintended behaviors. But what about teacher hacking in knowledge distillation: can the teacher be hacked, like rewards in RLHF?

thumb_up_off_alt41

chat_bubble_outline3

repeat12

shareShare

Alexandre Ramé

@ramealexandre

5 months ago

Modern post-training is essentially distillation then RL. While reward hacking is well-known and feared, could there be such a thing as teacher hacking? Our latest paper confirms it. Fortunately, we also show how to mitigate it! The secret: diversity, as always ^^

thumb_up_off_alt42

chat_bubble_outline1

repeat7

shareShare

Olivier Bachem

@olivierbachem

4 months ago

Really excited that we can finally share Gemma 3 with the world. The whole team spent a lot of hard work on this and the results speak for themselves: Being able to fit a top 10 LMSys model on a single accelerator will enable so many people to benefit from strong models.

thumb_up_off_alt38

chat_bubble_outline0

repeat9

shareShare

Robert Dadashi

@robdadashi

4 months ago

Today we are releasing the best open-weights model you can run on a single device reaching 1339 Elo on LMsys for Gemma 3 27B (aka zizou-10)! Very strong capabilities on math, multilingual, coding, instruction following, function calling !

thumb_up_off_alt92

chat_bubble_outline2

repeat19

shareShare

Johan Ferret

@johanferret

4 months ago

Glad to announce Gemma 3, our newest addition to the Gemma family of open models! Gemma 3 models are long context (128k), multimodal (image as input for 4B+) and significantly better at math, coding, reasoning and multilinguality. Post-trained with love from 🇫🇷 🇨🇭 🇳🇱

thumb_up_off_alt41

chat_bubble_outline2

repeat6

shareShare

Nino Vieillard

Gate.io

AK

Toshinori Kitamura

Johan Ferret

Rishabh Agarwal

Rishabh Agarwal

Rishabh Agarwal

AK

Alexandre Ramé

Alex Graveley

AK

Alexandre Ramé

Robert Dadashi

Rishabh Agarwal

Pier Giuseppe Sessa

Robert Dadashi

Daniil Tiapkin

Alexandre Ramé

Olivier Bachem

Robert Dadashi

Johan Ferret