Keller Jordan (@kellerjordan0) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

The reason I didn't write a proper arxiv paper for Muon is because I simply don't think there's any relationship between the ability to publish a paper with lots of good-looking results about a new optimizer, and whether that optimizer actually works. I only trust speedruns.

thumb_up_off_alt881

chat_bubble_outline15

repeat38

shareShare

Keller Jordan

@kellerjordan0

5 months ago

This is an exciting moment: The world's first report on successful large-scale training with a super-Adamic optimizer. Congratulations to the Kimi.ai team and to every Muon contributor: Yuchen Jin Vlado Boza You Jiacheng leloy! L. Newhouse Jeremy Bernstein x.com/Kimi_Moonshot/…

thumb_up_off_alt374

chat_bubble_outline7

repeat30

shareShare

Nat McAleese

@__nmca__

5 months ago

large reasoning models are extremely good at reward hacking. A thread of examples from OpenAI's recent monitoring paper: (0/n)

thumb_up_off_alt948

chat_bubble_outline14

repeat80

shareShare

Logan Engstrom

@logan_engstrom

4 months ago

Want state-of-the-art data curation, data poisoning & more? Just do gradient descent! w/ Andrew Ilyas Ben Chen Axel Feldmann Billy Moses Aleksander Madry: we show how to optimize final model loss wrt any continuous variable. Key idea: Metagradients (grads through model training)

Want state-of-the-art data curation, data poisoning & more? Just do gradient descent!

w/ <a href="/andrew_ilyas/">Andrew Ilyas</a> Ben Chen <a href="/axel_s_feldmann/">Axel Feldmann</a> <a href="/wsmoses/">Billy Moses</a> <a href="/aleks_madry/">Aleksander Madry</a>: we show how to optimize final model loss wrt any continuous variable.

Key idea: Metagradients (grads through model training)

thumb_up_off_alt162

chat_bubble_outline9

repeat29

shareShare

You Jiacheng

@youjiacheng

4 months ago

GPT-2 Medium speedrun new record candidate: 6710 steps (estimated time: ~26.1 minutes) previous record: 6950 steps (27.2 minutes) reproducible log: gist.github.com/YouJiacheng/6f… it was timed to be 25.95 minutes when tuning enabled

thumb_up_off_alt141

chat_bubble_outline4

repeat14

shareShare

Keller Jordan

@kellerjordan0

4 months ago

Congratulations You Jiacheng on this new speedrun record! It is an interesting one. x.com/YouJiacheng/st…

thumb_up_off_alt50

chat_bubble_outline0

repeat4

shareShare

You Jiacheng

@youjiacheng

3 months ago

GPT Medium new record: ~1525s (25.43 minutes) New sharded mixed precision Muon implementation in ＜100 lines: faster and only uses 4 + 6/DP bytes per param. Previous implementation uses 10 + 4/DP bytes, ZeRO-1 AdamW uses 4 + 12/DP bytes.

thumb_up_off_alt255

chat_bubble_outline7

repeat15

shareShare

Jaden Johnson

@jadenj3o

3 months ago

New NanoGPT-Medium speedrun record: 25.35 min -> 24.84 min Changes: Changed sliding window size scaling from linear to cubic (4x^3 - 6x^2 + 3x) Increased max sliding window size from 1728 -> 3456 Cooldown Fraction 0.6 -> 0.7 Iterations 6450 -> 5960 Reproducible log:

thumb_up_off_alt102

chat_bubble_outline2

repeat10

shareShare

Ashish Vaswani

@ashvaswani

3 months ago

Please check out our thorough study on the advantages of Muon. Second-order optimization is a promising path to more efficient LLM pretraining.

thumb_up_off_alt454

chat_bubble_outline9

repeat50

shareShare

Keller Jordan

@kellerjordan0

2 months ago

New NanoGPT training speed record: 3.28 FineWeb val loss in 2.990 minutes on 8xH100 Previous record: 3.014 minutes (1.44s slower) Changelog: Accelerated gradient all-reduce New record-holders: Konstantin Willeke et al. of The Enigma project

thumb_up_off_alt294

chat_bubble_outline5

repeat15

shareShare

Konstantin Willeke

@konstantinwille

2 months ago

New NanoGPT training speed world record from the Enigma Project 🎉 (Andreas Tolias Lab @ Stanford University, Sophia Sanborn, enigmaproject.ai) We improve the efficiency of gradient all_reduce. Short explainer of our method 👇 [1/6]

thumb_up_off_alt151

chat_bubble_outline1

repeat16

shareShare

Keller Jordan

@kellerjordan0

2 months ago

New NanoGPT training speed record: 3.28 FineWeb val loss in 2.979 minutes on 8xH100 Previous record: 2.990 minutes (0.7s slower) Changelog: Overlapped gradient communication with computation New record-holder: Ryan Yang-Liu

thumb_up_off_alt213

chat_bubble_outline4

repeat12

shareShare

Keller Jordan

@kellerjordan0

a month ago

There have been hundreds of optimizer papers published. But the SOTA has only improved a few times. Therefore we can conclude that almost all optimizer papers are fake. If you're gonna write another fake optimizer paper, please don't cite Muon. I don't want your citation

thumb_up_off_alt670

chat_bubble_outline19

repeat22

shareShare

Andrew Ilyas

@andrew_ilyas

a month ago

“How will my model behave if I change the training data?” Recent(-ish) work w/ Logan Engstrom: we nearly *perfectly* predict ML model behavior as a function of training data, saturating benchmarks for this problem (called “data attribution”).

“How will my model behave if I change the training data?”

Recent(-ish) work w/ <a href="/logan_engstrom/">Logan Engstrom</a>: we nearly *perfectly* predict ML model behavior as a function of training data, saturating benchmarks for this problem (called “data attribution”).

thumb_up_off_alt381

chat_bubble_outline10

repeat66

shareShare

Minqi Jiang

@minqijiang

25 days ago

Recently, there has been a lot of talk of LLM agents automating ML research itself. If Llama 5 can create Llama 6, then surely the singularity is just around the corner. How can we get a pulse check on whether current LLMs are capable of driving this kind of total

thumb_up_off_alt1,1K

chat_bubble_outline36

repeat181

shareShare

Andrej Karpathy

@karpathy

25 days ago

Love this project: nanoGPT -> recursive self-improvement benchmark. Good old nanoGPT keeps on giving and surprising :) - First I wrote it as a small little repo to teach people the basics of training GPTs. - Then it became a target and baseline for my port to direct C/CUDA

thumb_up_off_alt4,4K

chat_bubble_outline91

repeat668

shareShare

Keller Jordan

@kellerjordan0

22 days ago

"It's such a mystery why students these days think they can get ahead by cheating their way through university"

thumb_up_off_alt148

chat_bubble_outline6

repeat3

shareShare