Alex Hägele (@haeggee) 's Twitter Profile
Alex Hägele

@haeggee

Fellow @AnthropicAI + PhD Student in Machine Learning @ICepfl MLO. MSc/BSc from @ETH_en. Previously: Student Researcher @Apple MLR

ID: 1212418353094000640

linkhttps://haeggee.github.io/ calendar_today01-01-2020 17:00:47

271 Tweet

512 Followers

634 Following

Elliott Ash (@ellliottt) 's Twitter Profile Photo

Try out #Apertus, the fully open LLM (8B and 70B) developed by the Swiss AI Initiative (ETH, EPFL, CSCS). Apertus isn’t just open-weights—it’s fully reproducible -- "Apertus" is Latin for "Open"! ✅ Full code, data, checkpoints ✅ Respects robots.txt (retroactively) ✅ 🐠

Try out #Apertus, the fully open LLM (8B and 70B) developed by the Swiss AI Initiative (ETH, EPFL, CSCS).

Apertus isn’t just open-weights—it’s fully reproducible -- "Apertus" is Latin for "Open"! 

✅ Full code, data, checkpoints
✅ Respects robots.txt (retroactively) 
✅ 🐠
Paul Teiletche (@pteiletche) 's Twitter Profile Photo

Important release for the open-source community, check out the tech report! Glad to have been part of this great initiative🇨🇭

Matteo Pagliardini (@matpagliardini) 's Twitter Profile Photo

Lucas Beyer (bl16) Alex Hägele EPFL Andrei Semenov I got excited about the results we got in the AdEMAMix paper (I’m the first name on that paper). Around that time Andrei wanted to benchmark some of the recent optimizers. I naturally suggested AdEMAMix as I was curious how it would compare to others.

xlr8harder (@xlr8harder) 's Twitter Profile Photo

Apertus is a bigger deal than a lot of people realize. It has fully open data, and I believe this is the largest open training run to date.

Skander Moalla (@skandermoalla) 's Twitter Profile Photo

A big step for Switzerland 🇨🇭 and a great achievement for our in-house alignment algorithm QRPO (x.com/skandermoalla/…) which has shown remarkable stability and predictability at the 70B scale 🚀!

Konstantin Mishchenko (@konstmish) 's Twitter Profile Photo

Two new papers from this week benchmark recent optimizers: Muon, Soap, Mars, ScheduleFree, Prodigy, AdEMAMix, Sophia, etc. Also ablated in different settings (batch size, model scale, weight decay, scheduler). A must read for anyone working on practical optimization.

Two new papers from this week benchmark recent optimizers: Muon, Soap, Mars, ScheduleFree, Prodigy, AdEMAMix, Sophia, etc. Also ablated in different settings (batch size, model scale, weight decay, scheduler).

A must read for anyone working on practical optimization.
Kaiyue Wen (@wen_kaiyue) 's Twitter Profile Photo

(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)!

(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! &gt;4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups &lt;0.5B &amp; only 10% at 1.2B (8× Chinchilla)!
Ethan Perez (@ethanjperez) 's Twitter Profile Photo

We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵

Negar Foroutan ✈️ ICLR Singapore (@negarforoutan) 's Twitter Profile Photo

I’m glad to finally share the release of #Apertus, an open large language model developed together with colleagues at EPFL, ETH Zürich, and CSCS 🚀

Andrei Semenov (@andreisemenov17) 's Twitter Profile Photo

Optimizers. Finding 1. We observed that super large weight decay can be beneficial when training on a small number of tokens. Typically, runs with a large WD of 0.5 outperform those with a standard WD of 0.1 when training for ~2.5-3 Chinchillas (may change with the model size)

Optimizers. Finding 1.
We observed that super large weight decay can be beneficial when training on a small number of tokens. Typically, runs with a large WD of 0.5 outperform those with a standard WD of 0.1 when training for ~2.5-3 Chinchillas (may change with the model size)
Thinking Machines (@thinkymachines) 's Twitter Profile Photo

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference”

We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to
Angelika Romanou (@agromanou) 's Twitter Profile Photo

Proud to have been part of the team behind #Apertus 🌍✨ an open multilingual LLM. Trained on open data, supporting 1,800+ languages, and built with transparency, compliance & responsible AI in mind. 🤖 Try Apertus models: huggingface.co/collections/sw…

Aleksandr Dremov (@alexdremov_me) 's Twitter Profile Photo

Happy to share a paper we wrote at Apple — “Compute-Optimal Quantization-Aware Training”! TLDR: Treat QAT as a first-class citizen and plan it in advance if you want to achieve the best quantized model with the compute you have. arxiv.org/abs/2509.22935 🧵🧵🧵

Happy to share a paper we wrote at Apple — “Compute-Optimal Quantization-Aware Training”!

TLDR: Treat QAT as a first-class citizen and plan it in advance if you want to achieve the best quantized model with the compute you have. 

arxiv.org/abs/2509.22935

🧵🧵🧵
Alex Hägele (@haeggee) 's Twitter Profile Photo

In case you haven't seen, the updated report is finally on arXiv: arxiv.org/abs/2509.14233 And if you believe that such an initiative is cool: we are hiring research engineers! Come and work with the biggest public AI supercomputer --- 10k GPUs :) Link: careers.epfl.ch/job/Lausanne-A…

In case you haven't seen, the updated report is finally on arXiv: arxiv.org/abs/2509.14233
And if you believe that such an initiative is cool: we are hiring research engineers! Come and work with the biggest public AI supercomputer --- 10k GPUs :)
Link: careers.epfl.ch/job/Lausanne-A…
George Grigorev (@iamgrigorev) 's Twitter Profile Photo

Most labs are leaving 40% performance on the table in FP8 training They usually quantize only matmuls in MLPs and output projections – skipping the rest of the layers that unlock more gains. The main issue is activation outliers that emerge late in training due to lack of

Most labs are leaving 40% performance on the table in FP8 training

They usually quantize only matmuls in MLPs and output projections – skipping the rest of the layers that unlock more gains.
The main issue is activation outliers that emerge late in training due to lack of
Justin Deschenaux (@jdeschena) 's Twitter Profile Photo

✨ Masked Generative Models (MGMs) are powerful and can generate tokens in parallel. They’ve driven impressive results across text and images and are increasingly competitive with autoregressive (AR) models. Thrilled to share our latest work to accelerate MGMs (1/12) 🧵

Justin Deschenaux (@jdeschena) 's Twitter Profile Photo

📢 « Partition Generative Modeling (PGM): Masked Modeling without Masks » is out! 🚯 Masked diffusion models waste FLOPs processing countless mask tokens that carry no real information. ⚡We show how partitioning can replace masking, boosting throughput by >5.3x on text and up

📢 « Partition Generative Modeling (PGM): Masked Modeling without Masks » is out!

🚯 Masked diffusion models waste FLOPs processing countless mask tokens that carry no real information.

⚡We show how partitioning can replace masking, boosting throughput by &gt;5.3x on text and up
Atli Kosson (@atlikosson) 's Twitter Profile Photo

The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵

The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work?

We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵
Lucas Beyer (bl16) (@giffmana) 's Twitter Profile Photo

I have a thing for empirical deep dive into learning dynamics like done in this paper. Sounds like muP mostly helps the early training, while wd affects the long term.