Runa Eschenhagen (@runame_) 's Twitter Profile
Runa Eschenhagen

@runame_

PhD student in machine learning @CambridgeMLG. Previously research scientist intern @AIatMeta (FAIR).

ID: 1453810064385617921

linkhttps://runame.github.io/ calendar_today28-10-2021 19:45:56

168 Tweet

513 Followers

45 Following

Zhiyuan Li (@zhiyuanli_) 's Twitter Profile Photo

Why does Adam outperform SGD in LLMs training? Adaptive step sizes alone don't fully explain this, as Adam also surpasses adaptive SGD. Is coordinate-wise adaptivity the secret? Not entirely—Adam actually struggles in the rotated parameter space! 🧵 (1/6) arxiv.org/abs/2410.08198

Why does Adam outperform SGD in LLMs training? Adaptive step sizes alone don't fully explain this, as Adam also surpasses adaptive SGD.

Is coordinate-wise adaptivity the secret? Not entirely—Adam actually struggles in the rotated parameter space! 🧵 (1/6)
arxiv.org/abs/2410.08198
Bruno Mlodozeniec (@kayembruno) 's Twitter Profile Photo

Great to be back from Singapore from #ICLR2025, and super excited to have given my first oral presentation on influence functions for diffusion models!

Great to be back from Singapore from #ICLR2025, and super excited to have given my first oral presentation on influence functions for diffusion models!
Dan Roy (@roydanroy) 's Twitter Profile Photo

This is a huge development. I want to highlight the theoreticians behind the scene, because this paper represents the realization of the impact of years of careful theoretical research. It starts with Greg Yang (Greg Yang) opening up research on the muP scaling and

Katie Everett (@_katieeverett) 's Twitter Profile Photo

1. We often observe power laws between loss and compute: loss = a * flops ^ b + c 2. Models are rapidly becoming more efficient, i.e. use less compute to reach the same loss But: which innovations actually change the exponent in the power law (b) vs change only the constant (a)?

Antonio Orvieto (@orvieto_antonio) 's Twitter Profile Photo

Adam is similar to many algorithms, but cannot be effectively replaced by any simpler variant in LMs. The community is starting to get the recipe right, but what is the secret sauce? Robert M. Gower 🇺🇦 and I found that it has to do with the beta parameters and variational inference.

Adam is similar to many algorithms, but cannot be effectively replaced by any simpler variant in LMs.
The community is starting to get the recipe right, but what is the secret sauce?

<a href="/gowerrobert/">Robert M. Gower 🇺🇦</a> and I found that it has to do with the beta parameters and variational inference.
Aaron Defazio (@aaron_defazio) 's Twitter Profile Photo

Why do gradients increase near the end of training? Read the paper to find out! We also propose a simple fix to AdamW that keeps gradient norms better behaved throughout training. arxiv.org/abs/2506.02285

Why do gradients increase near the end of training? 
Read the paper to find out!
We also propose a simple fix to AdamW that keeps gradient norms better behaved throughout training.
arxiv.org/abs/2506.02285
Mark Schmidt (@markschmidtubc) 's Twitter Profile Photo

My former PhD student Fred Kunstner has been awarded the Canadian AI Assoc. / Assoc. canadienne pour l'IA Best Doctoral Dissertation Award: cs.ubc.ca/news/2025/06/f… His thesis on machine learning algorithms includes an EM proof "from the book", why Adam works, and the first provably-faster hyper-gradient method.

My former PhD student Fred Kunstner has been awarded the <a href="/c_a_i_a_c/">Canadian AI Assoc. / Assoc. canadienne pour l'IA</a> Best Doctoral Dissertation Award:
cs.ubc.ca/news/2025/06/f…

His thesis on machine learning algorithms includes an EM proof "from the book", why Adam works, and the first provably-faster hyper-gradient method.
Agustinus Kristiadi (@akristiadi7) 's Twitter Profile Photo

📢 [Openings] I'm now an Assistant Prof Western University CS dept. Funded PhD & MSc positions available! Topics: large probabilistic models, decision-making under uncertainty, and apps in AI4Science. More on agustinus.kristia.de/openings/

📢 [Openings] I'm now an Assistant Prof <a href="/WesternU/">Western University</a> CS dept. Funded PhD &amp; MSc positions available! Topics: large probabilistic models, decision-making under uncertainty, and apps in AI4Science. More on agustinus.kristia.de/openings/
Thomas Pethick (@tmpethick) 's Twitter Profile Photo

When comparing optimization methods, we often change *multiple things at once*—geometry, normalization, etc.—possibly without realizing it. Let's disentangle these changes. 👇

Bruno Mlodozeniec (@kayembruno) 's Twitter Profile Photo

You don't need bespoke tools for causal inference. Probabilistic modelling is enough. I'll be making this case (and dodging pitchforks) at our ICML oral presentation tomorrow.

You don't need bespoke tools for causal inference. Probabilistic modelling is enough.

I'll be making this case (and dodging pitchforks) at our ICML oral presentation tomorrow.
Thomas Zhang (@thomastckzhang) 's Twitter Profile Photo

I’ll be presenting our paper “On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning” at ICML during the Tuesday 11am poster session! DL opt is seeing a renaissance 🦾; what can we say from a NN feature learning perspective? 1/8

Jihao Andreas Lin (@jihaoandreaslin) 's Twitter Profile Photo

Excited to share our ICML 2025 paper: "Scalable Gaussian Processes with Latent Kronecker Structure" We unlock efficient linear algebra for your kernel matrix which *almost* has Kronecker product structure. Check out our paper here: arxiv.org/abs/2506.06895

Tycho van der Ouderaa (@tychovdo) 's Twitter Profile Photo

This past spring, I spent time with the EXO Labs team to work on a new DL optimizer and wiring up clusters of Macs for distributed TRAINING on Apple Silicon. If you’re at ICML, be sure to come by the ES-FoMo@ICML2025 workshop (posters 1-2:30pm) this Saturday. I’ll be there to share some

This past spring, I spent time with the <a href="/exolabs/">EXO Labs</a> team to work on a new DL optimizer and wiring up clusters of Macs for distributed TRAINING on Apple Silicon. If you’re at ICML, be sure to come by the <a href="/ESFoMo/">ES-FoMo@ICML2025</a> workshop (posters 1-2:30pm) this Saturday. I’ll be there to share some
Frank Schneider (@frankstefansch1) 's Twitter Profile Photo

At #ICML2025 and don't know which workshop to join? Why not come and celebrate/rant about open source ML with us? We got amazing speakers (Tri Dao is just one example)! Come by West Meeting Room 211-214 👋

At #ICML2025 and don't know which workshop to join? Why not come and celebrate/rant about open source ML with us? We got amazing speakers (<a href="/tri_dao/">Tri Dao</a> is just one example)! Come by West Meeting Room 211-214 👋