Nikhil Vyas (@vyasnikhil96) Twitter Tweets • TwiCopy

Nikhil Vyas

@vyasnikhil96

+ Follow

@OpenAI Prev: Postdoc at Harvard, PhD @MITEECS.

ID: 3316896696

linkhttps://nikhilvyas.github.io/ calendar_today16-08-2015 15:09:23

160 Tweet

752 Followers

609 Following

Sham Kakade

@shamkakade6

a year ago

Which optimizer is opt? Our new work compares SGD, Adam, Adafactor (+ momentum), Lion, and, simply, SignSGD on LLM training wrt performance _and_ hyperparameter stability. tldr: Use anything but SGD, the rest are nearly identical:

thumb_up_off_alt227

chat_bubble_outline6

repeat36

shareShare

Rosie Zhao

@rosieyzh

a year ago

In our new work on evaluating optimizers for LLM training, we perform a series of experiments to investigate the role of adaptivity in optimizers like Adam in achieving good performance and stability. A thread: 🧵

thumb_up_off_alt185

chat_bubble_outline6

repeat30

shareShare

Nikhil Vyas

@vyasnikhil96

a year ago

Thread on our new optimizer (SOAP) which mixes Shampoo and AdamW: x.com/ShamKakade6/st…

thumb_up_off_alt12

chat_bubble_outline0

repeat0

shareShare

David Brandfonbrener

@brandfonbrener

a year ago

How does test loss change as we change the training data? And how does this interact with scaling laws? We propose a methodology to approach these questions by showing that we can predict the performance across datasets and losses with simple shifted power law fits.

thumb_up_off_alt97

chat_bubble_outline3

repeat16

shareShare

Sham Kakade

@shamkakade6

a year ago

(1/n) 💡How can we speed up the serial runtime of long pre-training runs? Enter Critical Batch Size (CBS): the tipping point where the gains of data parallelism balance with diminishing efficiency. Doubling batch size halves the optimization steps—until we hit CBS, beyond which

thumb_up_off_alt137

chat_bubble_outline3

repeat32

shareShare

Sham Kakade

@shamkakade6

9 months ago

1/n In new work, we draw connections between accelerated SGD and various recent optimizers including AdEMAMix, Schedule-Free optimizers and MARS, and use it to design ‘Simplified-AdEMAMix’ which matches performance of AdEMAMix without any extra momentum buffer.

thumb_up_off_alt67

chat_bubble_outline3

repeat13

shareShare

Nikhil Vyas

@vyasnikhil96

9 months ago

Combining SOAP and Muon: nikhilvyas.github.io/SOAP_Muon.pdf and some rough thoughts on interesting future directions.

thumb_up_off_alt186

chat_bubble_outline3

repeat23

shareShare

Ashok Cutkosky

@ashokcutkosky

9 months ago

Some ideas on a new optimizer from my student Qinzi Zhang: (github.com/ZQZCalin/train…) Early stages, but the empirical results are really promising! Would love to hear any thoughts, either on the empirical side or analysis-wise, and open to collaboration!

thumb_up_off_alt81

chat_bubble_outline2

repeat16

shareShare

rohan anil

@_arohan_

9 months ago

Today some of my ex and new colleagues are hosting AlgoPerfy workshop algoperf-workshop.github.io I will drop by and participate in a panel tmmrw. David Kanter asked if he could use the fact that the area chair for ICLR called our work and timings on distributed shampoo useless about

thumb_up_off_alt88

chat_bubble_outline2

repeat18

shareShare

Depen Morwani

@depen_morwani

7 months ago

Excited to attend #ICLR25 this week. My DMs are open, feel free to drop a message to talk about anything related to optimization of deep networks. Presenting multiple works related to second order optimization, critical batch size and diagonal preconditioning. Details below.

thumb_up_off_alt24

chat_bubble_outline1

repeat5

shareShare