Keller Jordan (@kellerjordan0) 's Twitter Profile
Keller Jordan

@kellerjordan0

CIFAR-10 fanatic @OpenAI

ID: 712781250327490560

calendar_today23-03-2016 23:21:07

1,1K Tweet

9,9K Followers

331 Following

Keller Jordan (@kellerjordan0) 's Twitter Profile Photo

The reason I didn't write a proper arxiv paper for Muon is because I simply don't think there's any relationship between the ability to publish a paper with lots of good-looking results about a new optimizer, and whether that optimizer actually works. I only trust speedruns.

Keller Jordan (@kellerjordan0) 's Twitter Profile Photo

This is an exciting moment: The world's first report on successful large-scale training with a super-Adamic optimizer. Congratulations to the Kimi.ai team and to every Muon contributor: Yuchen Jin Vlado Boza You Jiacheng leloy! L. Newhouse Jeremy Bernstein x.com/Kimi_Moonshot/…

Nat McAleese (@__nmca__) 's Twitter Profile Photo

large reasoning models are extremely good at reward hacking. A thread of examples from OpenAI's recent monitoring paper: (0/n)

Logan Engstrom (@logan_engstrom) 's Twitter Profile Photo

Want state-of-the-art data curation, data poisoning & more? Just do gradient descent! w/ Andrew Ilyas Ben Chen Axel Feldmann Billy Moses Aleksander Madry: we show how to optimize final model loss wrt any continuous variable. Key idea: Metagradients (grads through model training)

Want state-of-the-art data curation, data poisoning & more? Just do gradient descent!

w/ <a href="/andrew_ilyas/">Andrew Ilyas</a> Ben Chen <a href="/axel_s_feldmann/">Axel Feldmann</a>  <a href="/wsmoses/">Billy Moses</a> <a href="/aleks_madry/">Aleksander Madry</a>: we show how to optimize final model loss wrt any continuous variable.

Key idea: Metagradients (grads through model training)
You Jiacheng (@youjiacheng) 's Twitter Profile Photo

GPT-2 Medium speedrun new record candidate: 6710 steps (estimated time: ~26.1 minutes) previous record: 6950 steps (27.2 minutes) reproducible log: gist.github.com/YouJiacheng/6f… it was timed to be 25.95 minutes when tuning enabled

GPT-2 Medium speedrun new record candidate: 6710 steps (estimated time: ~26.1 minutes)
previous record: 6950 steps (27.2 minutes)
reproducible log: gist.github.com/YouJiacheng/6f…
it was timed to be 25.95 minutes when tuning enabled
You Jiacheng (@youjiacheng) 's Twitter Profile Photo

GPT Medium new record: ~1525s (25.43 minutes) New sharded mixed precision Muon implementation in <100 lines: faster and only uses 4 + 6/DP bytes per param. Previous implementation uses 10 + 4/DP bytes, ZeRO-1 AdamW uses 4 + 12/DP bytes.

GPT Medium new record: ~1525s (25.43 minutes)
New sharded mixed precision Muon implementation in <100 lines: faster and only uses 4 + 6/DP bytes per param.
Previous implementation uses 10 + 4/DP bytes, ZeRO-1 AdamW uses 4 + 12/DP bytes.
Jaden Johnson (@jadenj3o) 's Twitter Profile Photo

New NanoGPT-Medium speedrun record: 25.35 min -> 24.84 min Changes: Changed sliding window size scaling from linear to cubic (4x^3 - 6x^2 + 3x) Increased max sliding window size from 1728 -> 3456 Cooldown Fraction 0.6 -> 0.7 Iterations 6450 -> 5960 Reproducible log:

Ashish Vaswani (@ashvaswani) 's Twitter Profile Photo

Please check out our thorough study on the advantages of Muon. Second-order optimization is a promising path to more efficient LLM pretraining.

Keller Jordan (@kellerjordan0) 's Twitter Profile Photo

New NanoGPT training speed record: 3.28 FineWeb val loss in 2.990 minutes on 8xH100 Previous record: 3.014 minutes (1.44s slower) Changelog: Accelerated gradient all-reduce New record-holders: Konstantin Willeke et al. of The Enigma project

New NanoGPT training speed record: 3.28 FineWeb val loss in 2.990 minutes on 8xH100

Previous record: 3.014 minutes (1.44s slower)
Changelog: Accelerated gradient all-reduce

New record-holders: <a href="/KonstantinWille/">Konstantin Willeke</a> et al. of The Enigma project
Konstantin Willeke (@konstantinwille) 's Twitter Profile Photo

New NanoGPT training speed world record from the Enigma Project 🎉 (Andreas Tolias Lab @ Stanford University, Sophia Sanborn, enigmaproject.ai) We improve the efficiency of gradient all_reduce. Short explainer of our method 👇 [1/6]

Keller Jordan (@kellerjordan0) 's Twitter Profile Photo

New NanoGPT training speed record: 3.28 FineWeb val loss in 2.979 minutes on 8xH100 Previous record: 2.990 minutes (0.7s slower) Changelog: Overlapped gradient communication with computation New record-holder: Ryan Yang-Liu

New NanoGPT training speed record: 3.28 FineWeb val loss in 2.979 minutes on 8xH100

Previous record: 2.990 minutes (0.7s slower)
Changelog: Overlapped gradient communication with computation

New record-holder: <a href="/ryanyang0/">Ryan Yang-Liu</a>
Keller Jordan (@kellerjordan0) 's Twitter Profile Photo

There have been hundreds of optimizer papers published. But the SOTA has only improved a few times. Therefore we can conclude that almost all optimizer papers are fake. If you're gonna write another fake optimizer paper, please don't cite Muon. I don't want your citation

Andrew Ilyas (@andrew_ilyas) 's Twitter Profile Photo

“How will my model behave if I change the training data?” Recent(-ish) work w/ Logan Engstrom: we nearly *perfectly* predict ML model behavior as a function of training data, saturating benchmarks for this problem (called “data attribution”).

“How will my model behave if I change the training data?”

Recent(-ish) work w/ <a href="/logan_engstrom/">Logan Engstrom</a>: we nearly *perfectly* predict ML model behavior as a function of training data, saturating benchmarks for this problem (called “data attribution”).
Minqi Jiang (@minqijiang) 's Twitter Profile Photo

Recently, there has been a lot of talk of LLM agents automating ML research itself. If Llama 5 can create Llama 6, then surely the singularity is just around the corner. How can we get a pulse check on whether current LLMs are capable of driving this kind of total

Recently, there has been a lot of talk of LLM agents automating ML research itself. If Llama 5 can create Llama 6, then surely the singularity is just around the corner. 

How can we get a pulse check on whether current LLMs are capable of driving this kind of total
Andrej Karpathy (@karpathy) 's Twitter Profile Photo

Love this project: nanoGPT -> recursive self-improvement benchmark. Good old nanoGPT keeps on giving and surprising :) - First I wrote it as a small little repo to teach people the basics of training GPTs. - Then it became a target and baseline for my port to direct C/CUDA