Runa Eschenhagen (@runame_) Twitter Tweets • TwiCopy

Zhiyuan Li

6 months ago

Why does Adam outperform SGD in LLMs training? Adaptive step sizes alone don't fully explain this, as Adam also surpasses adaptive SGD. Is coordinate-wise adaptivity the secret? Not entirely—Adam actually struggles in the rotated parameter space! 🧵 (1/6) arxiv.org/abs/2410.08198

thumb_up_off_alt269

chat_bubble_outline3

repeat35

shareShare

guille

@guilleangeris

6 months ago

typedfemale literally en.m.wikipedia.org/wiki/Kronecker… is all you need

thumb_up_off_alt6

chat_bubble_outline2

repeat2

shareShare

Jonathan Wenger

@jonathanwenger5

6 months ago

We have a fantastic lineup of speakers who have made deep contributions to open-source in ML, e.g. Sara Hooker, Dr. Chris Rackauckas, Matthew Johnson, Tri Dao, Stella Biderman @ ICML, and Evan Shelhamer!

thumb_up_off_alt16

chat_bubble_outline0

repeat5

shareShare

Bruno Mlodozeniec

@kayembruno

6 months ago

Great to be back from Singapore from #ICLR2025, and super excited to have given my first oral presentation on influence functions for diffusion models!

thumb_up_off_alt63

chat_bubble_outline1

repeat3

shareShare

Dan Roy

@roydanroy

6 months ago

This is a huge development. I want to highlight the theoreticians behind the scene, because this paper represents the realization of the impact of years of careful theoretical research. It starts with Greg Yang (Greg Yang) opening up research on the muP scaling and

thumb_up_off_alt404

chat_bubble_outline7

repeat52

shareShare

Aaron Defazio

@aaron_defazio

6 months ago

Write the paper you would want to read

thumb_up_off_alt44

chat_bubble_outline0

repeat5

shareShare

Katie Everett

@_katieeverett

5 months ago

1. We often observe power laws between loss and compute: loss = a * flops ^ b + c 2. Models are rapidly becoming more efficient, i.e. use less compute to reach the same loss But: which innovations actually change the exponent in the power law (b) vs change only the constant (a)?

thumb_up_off_alt254

chat_bubble_outline8

repeat44

shareShare

Antonio Orvieto

@orvieto_antonio

5 months ago

Adam is similar to many algorithms, but cannot be effectively replaced by any simpler variant in LMs. The community is starting to get the recipe right, but what is the secret sauce? Robert M. Gower 🇺🇦 and I found that it has to do with the beta parameters and variational inference.

thumb_up_off_alt259

chat_bubble_outline10

repeat37

shareShare

Arthur Douillard

@ar_douillard

5 months ago

duality of humanity

thumb_up_off_alt17

chat_bubble_outline2

repeat1

shareShare

Aaron Defazio

@aaron_defazio

5 months ago

Why do gradients increase near the end of training? Read the paper to find out! We also propose a simple fix to AdamW that keeps gradient norms better behaved throughout training. arxiv.org/abs/2506.02285

thumb_up_off_alt492

chat_bubble_outline12

repeat62

shareShare

Mark Schmidt

@markschmidtubc

5 months ago

My former PhD student Fred Kunstner has been awarded the Canadian AI Assoc. / Assoc. canadienne pour l'IA Best Doctoral Dissertation Award: cs.ubc.ca/news/2025/06/f… His thesis on machine learning algorithms includes an EM proof "from the book", why Adam works, and the first provably-faster hyper-gradient method.

My former PhD student Fred Kunstner has been awarded the <a href="/c_a_i_a_c/">Canadian AI Assoc. / Assoc. canadienne pour l'IA</a> Best Doctoral Dissertation Award:
cs.ubc.ca/news/2025/06/f…

His thesis on machine learning algorithms includes an EM proof "from the book", why Adam works, and the first provably-faster hyper-gradient method.

thumb_up_off_alt238

chat_bubble_outline3

repeat23

shareShare

Joost van Amersfoort

@joost_v_amersf

4 months ago

Never will be.

thumb_up_off_alt140

chat_bubble_outline4

repeat13

shareShare

Agustinus Kristiadi

@akristiadi7

4 months ago

📢 [Openings] I'm now an Assistant Prof Western University CS dept. Funded PhD & MSc positions available! Topics: large probabilistic models, decision-making under uncertainty, and apps in AI4Science. More on agustinus.kristia.de/openings/

📢 [Openings] I'm now an Assistant Prof <a href="/WesternU/">Western University</a> CS dept. Funded PhD & MSc positions available! Topics: large probabilistic models, decision-making under uncertainty, and apps in AI4Science. More on agustinus.kristia.de/openings/

thumb_up_off_alt26

chat_bubble_outline1

repeat11

shareShare

Thomas Pethick

@tmpethick

4 months ago

When comparing optimization methods, we often change *multiple things at once*—geometry, normalization, etc.—possibly without realizing it. Let's disentangle these changes. 👇

thumb_up_off_alt5

chat_bubble_outline1

repeat2

shareShare

Bruno Mlodozeniec

@kayembruno

3 months ago

You don't need bespoke tools for causal inference. Probabilistic modelling is enough. I'll be making this case (and dodging pitchforks) at our ICML oral presentation tomorrow.

thumb_up_off_alt15

chat_bubble_outline1

repeat4

shareShare

Thomas Zhang

@thomastckzhang

3 months ago

I’ll be presenting our paper “On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning” at ICML during the Tuesday 11am poster session! DL opt is seeing a renaissance 🦾; what can we say from a NN feature learning perspective? 1/8

thumb_up_off_alt62

chat_bubble_outline2

repeat9

shareShare

Jihao Andreas Lin

@jihaoandreaslin

3 months ago

Excited to share our ICML 2025 paper: "Scalable Gaussian Processes with Latent Kronecker Structure" We unlock efficient linear algebra for your kernel matrix which *almost* has Kronecker product structure. Check out our paper here: arxiv.org/abs/2506.06895

thumb_up_off_alt21

chat_bubble_outline1

repeat9

shareShare

Tycho van der Ouderaa

@tychovdo

3 months ago

This past spring, I spent time with the EXO Labs team to work on a new DL optimizer and wiring up clusters of Macs for distributed TRAINING on Apple Silicon. If you’re at ICML, be sure to come by the ES-FoMo@ICML2025 workshop (posters 1-2:30pm) this Saturday. I’ll be there to share some

This past spring, I spent time with the <a href="/exolabs/">EXO Labs</a> team to work on a new DL optimizer and wiring up clusters of Macs for distributed TRAINING on Apple Silicon. If you’re at ICML, be sure to come by the <a href="/ESFoMo/">ES-FoMo@ICML2025</a> workshop (posters 1-2:30pm) this Saturday. I’ll be there to share some

thumb_up_off_alt103

chat_bubble_outline3

repeat13

shareShare

Frank Schneider

@frankstefansch1

3 months ago

At #ICML2025 and don't know which workshop to join? Why not come and celebrate/rant about open source ML with us? We got amazing speakers (Tri Dao is just one example)! Come by West Meeting Room 211-214 👋

At #ICML2025 and don't know which workshop to join? Why not come and celebrate/rant about open source ML with us? We got amazing speakers (<a href="/tri_dao/">Tri Dao</a> is just one example)! Come by West Meeting Room 211-214 👋

thumb_up_off_alt6

chat_bubble_outline0

repeat1

shareShare