Depen Morwani (@depen_morwani) 's Twitter Profile
Depen Morwani

@depen_morwani

PhD student at Harvard ML Foundations, Research Associate at Google AI, completed MS from IIT Madras

ID: 2560976503

calendar_today11-06-2014 09:10:39

79 Tweet

211 Followers

136 Following

Sham Kakade (@shamkakade6) 's Twitter Profile Photo

Why does Shampoo work well? Our new work sheds light on this, highlighting a wide misconception about the optimizer. We show the *square* of Shampoo's preconditioner is provably near to the optimal Kronecker approximation of the (Adagrad) Hessian. See: arxiv.org/pdf/2406.17748

Nikhil Vyas (@vyasnikhil96) 's Twitter Profile Photo

1/n A technical thread on our results in arxiv.org/pdf/2406.17748 on connecting the Shampoo optimizer and Optimal Kronecker product approximation of the the Adagrad (or Hessian) preconditioner.

Kempner Institute at Harvard University (@kempnerinst) 's Twitter Profile Photo

We're thrilled to introduce the 2024 cohort of #KempnerInstitute Graduate Fellows! This year’s recipients include seven incoming and eight continuing graduate students enrolled across six Harvard University Ph.D. programs. Read more: bit.ly/3L7cPW9 #AI #NeuroAI #ML

We're thrilled to introduce the 2024 cohort of #KempnerInstitute Graduate Fellows! This year’s recipients include seven incoming and eight continuing graduate students enrolled across six <a href="/Harvard/">Harvard University</a> Ph.D. programs. Read more: bit.ly/3L7cPW9  
#AI #NeuroAI #ML
Kempner Institute at Harvard University (@kempnerinst) 's Twitter Profile Photo

NEW #KempnerInstitute blog: Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas & Sham Kakade study a variety of #LLM training optimizers and find they are all fairly similar except for SGD, which is notably worse. Read more: bit.ly/3S5PmZk #ML #AI

NEW #KempnerInstitute blog: <a href="/rosieyzh/">Rosie Zhao</a>, <a href="/depen_morwani/">Depen Morwani</a>, <a href="/brandfonbrener/">David Brandfonbrener</a>, <a href="/vyasnikhil96/">Nikhil Vyas</a> &amp; <a href="/ShamKakade6/">Sham Kakade</a> study a variety of #LLM training optimizers and find they are all fairly similar except for SGD, which is notably worse. Read more: bit.ly/3S5PmZk #ML #AI
Sham Kakade (@shamkakade6) 's Twitter Profile Photo

Which optimizer is opt? Our new work compares SGD, Adam, Adafactor (+ momentum), Lion, and, simply, SignSGD on LLM training wrt performance _and_ hyperparameter stability. tldr: Use anything but SGD, the rest are nearly identical:

Which optimizer is opt? Our new work compares SGD, Adam, Adafactor (+ momentum), Lion, and, simply, SignSGD on LLM training wrt performance _and_ hyperparameter stability. tldr: Use anything but SGD, the rest are nearly identical:
Rosie Zhao (@rosieyzh) 's Twitter Profile Photo

In our new work on evaluating optimizers for LLM training, we perform a series of experiments to investigate the role of adaptivity in optimizers like Adam in achieving good performance and stability. A thread: đź§µ

In our new work on evaluating optimizers for LLM training, we perform a series of experiments to investigate the role of adaptivity in optimizers like Adam in achieving good performance and stability. A thread: đź§µ
Sham Kakade (@shamkakade6) 's Twitter Profile Photo

1/n Introducing SOAP (ShampoO with Adam in the Preconditioner's eigenbasis): A deep learning optimization algorithm that applies Adam in Shampoo's eigenbasis. SOAP outperforms both AdamW and Shampoo in language model pretraining.

1/n Introducing SOAP (ShampoO with Adam in the Preconditioner's eigenbasis): A deep learning optimization algorithm that applies Adam in Shampoo's eigenbasis. SOAP outperforms both AdamW and Shampoo in language model pretraining.
Sham Kakade (@shamkakade6) 's Twitter Profile Photo

1/5⚡Introducing Flash Inference: an *exact* method cutting inference time for Long Convolution Sequence Models (LCSMs) to near-linear O(L log² L) complexity! Faster inference, same precision. Learn how we accelerate LCSM inference.

1/5⚡Introducing Flash Inference: an *exact* method cutting inference time for Long Convolution Sequence Models (LCSMs) to near-linear O(L log² L) complexity! Faster inference, same precision. Learn how we accelerate LCSM inference.
Sham Kakade (@shamkakade6) 's Twitter Profile Photo

(1/n) 💡How can we speed up the serial runtime of long pre-training runs? Enter Critical Batch Size (CBS): the tipping point where the gains of data parallelism balance with diminishing efficiency. Doubling batch size halves the optimization steps—until we hit CBS, beyond which

(1/n) 💡How can we speed up the serial runtime of long pre-training runs? Enter Critical Batch Size (CBS): the tipping point where the gains of data parallelism balance with diminishing efficiency. Doubling batch size halves the optimization steps—until we hit CBS, beyond which
Depen Morwani (@depen_morwani) 's Twitter Profile Photo

Very excited to speak at the Theory of Interpretable AI seminar coming Thursday at 10am EST. Please join to hear about the recent work on how margin maximization can explain some intriguing observations in mechanistic interpretability.

Gustaf Ahdritz (@gahdritz) 's Twitter Profile Photo

New preprint (w/ Apple collaborators Aravind Gollakota, Parikshit Gopalan, Charlotte Peale, and Udi Wieder)! We define "higher-order calibration" and prove that it's a necessary and sufficient condition for accurate uncertainty decomposition. (1/N) arxiv.org/abs/2412.18808

Depen Morwani (@depen_morwani) 's Twitter Profile Photo

Excited to share this new work, where we draw connections between theoretical accelerated SGD variants and practical optimizers such as Schedule-Free and AdEMAMix. This connection also helps us in matching AdEMAMix performance without extra buffer. Detailed thread below.