Alex Hägele (@haeggee) Twitter Tweets • TwiCopy

Elliott Ash

2 months ago

Try out #Apertus, the fully open LLM (8B and 70B) developed by the Swiss AI Initiative (ETH, EPFL, CSCS). Apertus isn’t just open-weights—it’s fully reproducible -- "Apertus" is Latin for "Open"! ✅ Full code, data, checkpoints ✅ Respects robots.txt (retroactively) ✅ 🐠

thumb_up_off_alt79

chat_bubble_outline7

repeat26

shareShare

Paul Teiletche

@pteiletche

2 months ago

Important release for the open-source community, check out the tech report! Glad to have been part of this great initiative🇨🇭

thumb_up_off_alt11

chat_bubble_outline0

repeat1

shareShare

Matteo Pagliardini

@matpagliardini

2 months ago

Lucas Beyer (bl16) Alex Hägele EPFL Andrei Semenov I got excited about the results we got in the AdEMAMix paper (I’m the first name on that paper). Around that time Andrei wanted to benchmark some of the recent optimizers. I naturally suggested AdEMAMix as I was curious how it would compare to others.

thumb_up_off_alt5

chat_bubble_outline0

repeat1

shareShare

xlr8harder

@xlr8harder

2 months ago

Apertus is a bigger deal than a lot of people realize. It has fully open data, and I believe this is the largest open training run to date.

thumb_up_off_alt90

chat_bubble_outline3

repeat4

shareShare

Skander Moalla

@skandermoalla

2 months ago

A big step for Switzerland 🇨🇭 and a great achievement for our in-house alignment algorithm QRPO (x.com/skandermoalla/…) which has shown remarkable stability and predictability at the 70B scale 🚀!

thumb_up_off_alt20

chat_bubble_outline1

repeat5

shareShare

Konstantin Mishchenko

@konstmish

2 months ago

Two new papers from this week benchmark recent optimizers: Muon, Soap, Mars, ScheduleFree, Prodigy, AdEMAMix, Sophia, etc. Also ablated in different settings (batch size, model scale, weight decay, scheduler). A must read for anyone working on practical optimization.

thumb_up_off_alt105

chat_bubble_outline1

repeat6

shareShare

Kaiyue Wen

@wen_kaiyue

2 months ago

(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)!

thumb_up_off_alt423

chat_bubble_outline12

repeat90

shareShare

Ethan Perez

@ethanjperez

2 months ago

We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵

thumb_up_off_alt213

chat_bubble_outline4

repeat41

shareShare

Negar Foroutan ✈️ ICLR Singapore

@negarforoutan

2 months ago

I’m glad to finally share the release of #Apertus, an open large language model developed together with colleagues at EPFL, ETH Zürich, and CSCS 🚀

thumb_up_off_alt104

chat_bubble_outline4

repeat6

shareShare

Andrei Semenov

@andreisemenov17

2 months ago

Optimizers. Finding 1. We observed that super large weight decay can be beneficial when training on a small number of tokens. Typically, runs with a large WD of 0.5 outperform those with a standard WD of 0.1 when training for ~2.5-3 Chinchillas (may change with the model size)

thumb_up_off_alt15

chat_bubble_outline2

repeat2

shareShare

Thinking Machines

@thinkymachines

2 months ago

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to

thumb_up_off_alt6,6K

chat_bubble_outline205

repeat1,1K

shareShare

Angelika Romanou

@agromanou

2 months ago

Proud to have been part of the team behind #Apertus 🌍✨ an open multilingual LLM. Trained on open data, supporting 1,800+ languages, and built with transparency, compliance & responsible AI in mind. 🤖 Try Apertus models: huggingface.co/collections/sw…

thumb_up_off_alt8

chat_bubble_outline0

repeat5

shareShare

Aleksandr Dremov

@alexdremov_me

a month ago

Happy to share a paper we wrote at Apple — “Compute-Optimal Quantization-Aware Training”! TLDR: Treat QAT as a first-class citizen and plan it in advance if you want to achieve the best quantized model with the compute you have. arxiv.org/abs/2509.22935 🧵🧵🧵

thumb_up_off_alt240

chat_bubble_outline4

repeat43

shareShare

Alex Hägele

@haeggee

a month ago

In case you haven't seen, the updated report is finally on arXiv: arxiv.org/abs/2509.14233 And if you believe that such an initiative is cool: we are hiring research engineers! Come and work with the biggest public AI supercomputer --- 10k GPUs :) Link: careers.epfl.ch/job/Lausanne-A…

thumb_up_off_alt47

chat_bubble_outline2

repeat8

shareShare

George Grigorev

@iamgrigorev

a month ago

Most labs are leaving 40% performance on the table in FP8 training They usually quantize only matmuls in MLPs and output projections – skipping the rest of the layers that unlock more gains. The main issue is activation outliers that emerge late in training due to lack of

thumb_up_off_alt481

chat_bubble_outline7

repeat38

shareShare

Justin Deschenaux

@jdeschena

19 days ago

✨ Masked Generative Models (MGMs) are powerful and can generate tokens in parallel. They’ve driven impressive results across text and images and are increasingly competitive with autoregressive (AR) models. Thrilled to share our latest work to accelerate MGMs (1/12) 🧵

thumb_up_off_alt34

chat_bubble_outline2

repeat12

shareShare

Justin Deschenaux

@jdeschena

16 days ago

📢 « Partition Generative Modeling (PGM): Masked Modeling without Masks » is out! 🚯 Masked diffusion models waste FLOPs processing countless mask tokens that carry no real information. ⚡We show how partitioning can replace masking, boosting throughput by >5.3x on text and up

thumb_up_off_alt89

chat_bubble_outline1

repeat23

shareShare

Atli Kosson

@atlikosson

9 days ago

The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵

thumb_up_off_alt289

chat_bubble_outline11

repeat41

shareShare

Lucas Beyer (bl16)

@giffmana

9 days ago

I have a thing for empirical deep dive into learning dynamics like done in this paper. Sounds like muP mostly helps the early training, while wd affects the long term.

thumb_up_off_alt272

chat_bubble_outline7

repeat14

shareShare