
Keller Jordan
@kellerjordan0
CIFAR-10 fanatic @OpenAI
ID: 712781250327490560
23-03-2016 23:21:07
1,1K Tweet
9,9K Followers
331 Following


This is an exciting moment: The world's first report on successful large-scale training with a super-Adamic optimizer. Congratulations to the Kimi.ai team and to every Muon contributor: Yuchen Jin Vlado Boza You Jiacheng leloy! L. Newhouse Jeremy Bernstein x.com/Kimi_Moonshot/…


Want state-of-the-art data curation, data poisoning & more? Just do gradient descent! w/ Andrew Ilyas Ben Chen Axel Feldmann Billy Moses Aleksander Madry: we show how to optimize final model loss wrt any continuous variable. Key idea: Metagradients (grads through model training)



Congratulations You Jiacheng on this new speedrun record! It is an interesting one. x.com/YouJiacheng/st…




New NanoGPT training speed record: 3.28 FineWeb val loss in 2.990 minutes on 8xH100 Previous record: 3.014 minutes (1.44s slower) Changelog: Accelerated gradient all-reduce New record-holders: Konstantin Willeke et al. of The Enigma project


New NanoGPT training speed world record from the Enigma Project 🎉 (Andreas Tolias Lab @ Stanford University, Sophia Sanborn, enigmaproject.ai) We improve the efficiency of gradient all_reduce. Short explainer of our method 👇 [1/6]

New NanoGPT training speed record: 3.28 FineWeb val loss in 2.979 minutes on 8xH100 Previous record: 2.990 minutes (0.7s slower) Changelog: Overlapped gradient communication with computation New record-holder: Ryan Yang-Liu



“How will my model behave if I change the training data?” Recent(-ish) work w/ Logan Engstrom: we nearly *perfectly* predict ML model behavior as a function of training data, saturating benchmarks for this problem (called “data attribution”).



