Nikhil Vyas (@vyasnikhil96) 's Twitter Profile
Nikhil Vyas

@vyasnikhil96

@OpenAI Prev: Postdoc at Harvard, PhD @MITEECS.

ID: 3316896696

linkhttps://nikhilvyas.github.io/ calendar_today16-08-2015 15:09:23

160 Tweet

752 Followers

609 Following

Sham Kakade (@shamkakade6) 's Twitter Profile Photo

Which optimizer is opt? Our new work compares SGD, Adam, Adafactor (+ momentum), Lion, and, simply, SignSGD on LLM training wrt performance _and_ hyperparameter stability. tldr: Use anything but SGD, the rest are nearly identical:

Which optimizer is opt? Our new work compares SGD, Adam, Adafactor (+ momentum), Lion, and, simply, SignSGD on LLM training wrt performance _and_ hyperparameter stability. tldr: Use anything but SGD, the rest are nearly identical:
Rosie Zhao (@rosieyzh) 's Twitter Profile Photo

In our new work on evaluating optimizers for LLM training, we perform a series of experiments to investigate the role of adaptivity in optimizers like Adam in achieving good performance and stability. A thread: 🧵

In our new work on evaluating optimizers for LLM training, we perform a series of experiments to investigate the role of adaptivity in optimizers like Adam in achieving good performance and stability. A thread: 🧵
David Brandfonbrener (@brandfonbrener) 's Twitter Profile Photo

How does test loss change as we change the training data? And how does this interact with scaling laws? We propose a methodology to approach these questions by showing that we can predict the performance across datasets and losses with simple shifted power law fits.

How does test loss change as we change the training data? And how does this interact with scaling laws?

We propose a methodology to approach these questions by showing that we can predict the performance across datasets and losses with simple shifted power law fits.
Sham Kakade (@shamkakade6) 's Twitter Profile Photo

(1/n) 💡How can we speed up the serial runtime of long pre-training runs? Enter Critical Batch Size (CBS): the tipping point where the gains of data parallelism balance with diminishing efficiency. Doubling batch size halves the optimization steps—until we hit CBS, beyond which

(1/n) 💡How can we speed up the serial runtime of long pre-training runs? Enter Critical Batch Size (CBS): the tipping point where the gains of data parallelism balance with diminishing efficiency. Doubling batch size halves the optimization steps—until we hit CBS, beyond which
Sham Kakade (@shamkakade6) 's Twitter Profile Photo

1/n In new work, we draw connections between accelerated SGD and various recent optimizers including AdEMAMix, Schedule-Free optimizers and MARS, and use it to design ‘Simplified-AdEMAMix’ which matches performance of AdEMAMix without any extra momentum buffer.

1/n In new work, we draw connections between accelerated SGD and various recent optimizers including AdEMAMix, Schedule-Free optimizers and MARS, and use it to design ‘Simplified-AdEMAMix’ which matches performance of AdEMAMix without any extra momentum buffer.
Ashok Cutkosky (@ashokcutkosky) 's Twitter Profile Photo

Some ideas on a new optimizer from my student Qinzi Zhang: (github.com/ZQZCalin/train…) Early stages, but the empirical results are really promising! Would love to hear any thoughts, either on the empirical side or analysis-wise, and open to collaboration!

rohan anil (@_arohan_) 's Twitter Profile Photo

Today some of my ex and new colleagues are hosting AlgoPerfy workshop algoperf-workshop.github.io I will drop by and participate in a panel tmmrw. David Kanter asked if he could use the fact that the area chair for ICLR called our work and timings on distributed shampoo useless about

Depen Morwani (@depen_morwani) 's Twitter Profile Photo

Excited to attend #ICLR25 this week. My DMs are open, feel free to drop a message to talk about anything related to optimization of deep networks. Presenting multiple works related to second order optimization, critical batch size and diagonal preconditioning. Details below.