Depen Morwani (@depen_morwani) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Why does Shampoo work well? Our new work sheds light on this, highlighting a wide misconception about the optimizer. We show the *square* of Shampoo's preconditioner is provably near to the optimal Kronecker approximation of the (Adagrad) Hessian. See: arxiv.org/pdf/2406.17748

thumb_up_off_alt121

chat_bubble_outline7

repeat16

shareShare

Nikhil Vyas

@vyasnikhil96

a year ago

1/n A technical thread on our results in arxiv.org/pdf/2406.17748 on connecting the Shampoo optimizer and Optimal Kronecker product approximation of the the Adagrad (or Hessian) preconditioner.

thumb_up_off_alt43

chat_bubble_outline8

repeat11

shareShare

Kempner Institute at Harvard University

@kempnerinst

a year ago

We're thrilled to introduce the 2024 cohort of #KempnerInstitute Graduate Fellows! This year’s recipients include seven incoming and eight continuing graduate students enrolled across six Harvard University Ph.D. programs. Read more: bit.ly/3L7cPW9 #AI #NeuroAI #ML

thumb_up_off_alt64

chat_bubble_outline2

repeat8

shareShare

Kempner Institute at Harvard University

@kempnerinst

a year ago

NEW #KempnerInstitute blog: Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas & Sham Kakade study a variety of #LLM training optimizers and find they are all fairly similar except for SGD, which is notably worse. Read more: bit.ly/3S5PmZk #ML #AI

NEW #KempnerInstitute blog: <a href="/rosieyzh/">Rosie Zhao</a>, <a href="/depen_morwani/">Depen Morwani</a>, <a href="/brandfonbrener/">David Brandfonbrener</a>, <a href="/vyasnikhil96/">Nikhil Vyas</a> & <a href="/ShamKakade6/">Sham Kakade</a> study a variety of #LLM training optimizers and find they are all fairly similar except for SGD, which is notably worse. Read more: bit.ly/3S5PmZk #ML #AI

thumb_up_off_alt17

chat_bubble_outline0

repeat5

shareShare

Sham Kakade

@shamkakade6

a year ago

Which optimizer is opt? Our new work compares SGD, Adam, Adafactor (+ momentum), Lion, and, simply, SignSGD on LLM training wrt performance _and_ hyperparameter stability. tldr: Use anything but SGD, the rest are nearly identical:

thumb_up_off_alt227

chat_bubble_outline6

repeat36

shareShare

Rosie Zhao

@rosieyzh

a year ago

In our new work on evaluating optimizers for LLM training, we perform a series of experiments to investigate the role of adaptivity in optimizers like Adam in achieving good performance and stability. A thread: 🧵

thumb_up_off_alt185

chat_bubble_outline6

repeat30

shareShare

Sham Kakade

@shamkakade6

10 months ago

1/n Introducing SOAP (ShampoO with Adam in the Preconditioner's eigenbasis): A deep learning optimization algorithm that applies Adam in Shampoo's eigenbasis. SOAP outperforms both AdamW and Shampoo in language model pretraining.

thumb_up_off_alt345

chat_bubble_outline6

repeat62

shareShare

Sham Kakade

@shamkakade6

9 months ago

1/5⚡Introducing Flash Inference: an *exact* method cutting inference time for Long Convolution Sequence Models (LCSMs) to near-linear O(L log² L) complexity! Faster inference, same precision. Learn how we accelerate LCSM inference.

thumb_up_off_alt95

chat_bubble_outline2

repeat23

shareShare

Sham Kakade

@shamkakade6

8 months ago

(1/n) 💡How can we speed up the serial runtime of long pre-training runs? Enter Critical Batch Size (CBS): the tipping point where the gains of data parallelism balance with diminishing efficiency. Doubling batch size halves the optimization steps—until we hit CBS, beyond which

thumb_up_off_alt137

chat_bubble_outline3

repeat32

shareShare

Depen Morwani

@depen_morwani

8 months ago

Very excited to speak at the Theory of Interpretable AI seminar coming Thursday at 10am EST. Please join to hear about the recent work on how margin maximization can explain some intriguing observations in mechanistic interpretability.

thumb_up_off_alt19

chat_bubble_outline0

repeat4

shareShare

Depen Morwani

@depen_morwani

8 months ago

Would be at NeurIPS from Thursday to Sunday. Happy to chat with anyone about optimization in deep learning. DMs are open.

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Gustaf Ahdritz

@gahdritz

7 months ago

New preprint (w/ Apple collaborators Aravind Gollakota, Parikshit Gopalan, Charlotte Peale, and Udi Wieder)! We define "higher-order calibration" and prove that it's a necessary and sufficient condition for accurate uncertainty decomposition. (1/N) arxiv.org/abs/2412.18808

thumb_up_off_alt9

chat_bubble_outline1

repeat2

shareShare

Depen Morwani

@depen_morwani

6 months ago

Excited to share this new work, where we draw connections between theoretical accelerated SGD variants and practical optimizers such as Schedule-Free and AdEMAMix. This connection also helps us in matching AdEMAMix performance without extra buffer. Detailed thread below.

thumb_up_off_alt16

chat_bubble_outline1

repeat4

shareShare