Kimbo Chen (@kimbochen) Twitter Tweets • TwiCopy

stochasm

4 months ago

This paper answers a long-standing question I had and claims that decay + merging does not outperform merging alone, which simplifies things quite nicely

thumb_up_off_alt115

chat_bubble_outline1

repeat3

shareShare

Here are all the architecture tricks used by gpt-oss: - Attention sinks - for each attention head, have a learned scalar such that softmax(qk) becomes softmax over [a_1, a_2, ..., a_T, sink]. Tokens don't have to attend to anything if all the attention scores are low! -

thumb_up_off_alt754

chat_bubble_outline10

repeat64

shareShare

Xiangming Gu @ ICLR 2025

@gu_xiangming

4 months ago

I noticed that OpenAI added learnable bias to attention logits before softmax. After softmax, they deleted the bias. This is similar to what I have done in my ICLR2025 paper: openreview.net/forum?id=78Nn4…. I used learnable key bias and set corresponding value bias zero. In this way,

I noticed that <a href="/OpenAI/">OpenAI</a> added learnable bias to attention logits before softmax. After softmax, they deleted the bias. This is similar to what I have done in my ICLR2025 paper: openreview.net/forum?id=78Nn4….
I used learnable key bias and set corresponding value bias zero. In this way,

thumb_up_off_alt1,1K

chat_bubble_outline22

repeat166

shareShare

Feng Yao

@fengyao1909

4 months ago

Failing on 𝐥𝐚𝐫𝐠𝐞-𝐬𝐜𝐚𝐥𝐞 𝐑𝐋 with VeRL? ⚠️ Mixing inference backend (𝐯𝐋𝐋𝐌/𝐒𝐆𝐋𝐚𝐧𝐠) with training backends (𝐅𝐒𝐃𝐏/𝐌𝐞𝐠𝐚𝐭𝐫𝐨𝐧) 𝐬𝐞𝐜𝐫𝐞𝐭𝐥𝐲 𝐭𝐮𝐫𝐧𝐬 𝐲𝐨𝐮𝐫 𝐑𝐋 𝐢𝐧𝐭𝐨 𝐨𝐟𝐟-𝐩𝐨𝐥𝐢𝐜𝐲 — even if they share the same weights! 📉 Blog:

thumb_up_off_alt461

chat_bubble_outline5

repeat69

shareShare

Jonathan Chang

@cccntu

4 months ago

while we wait for gpt-5 to drop. Here is a flex attention tutorial for building a < 1000 LoC vllm from scratch jonathanc.net/blog/vllm-flex…

thumb_up_off_alt325

chat_bubble_outline8

repeat27

shareShare

Guangxuan Xiao

@guangxuan_xiao

4 months ago

I've written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models. For those interested in the details: hanlab.mit.edu/blog/streaming…

thumb_up_off_alt895

chat_bubble_outline17

repeat114

shareShare

Jinjie Ni @ ICLR'25 🇸🇬

@nijinjie

4 months ago

Token crisis: solved. ✅ We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch — up to 8B params, 480B tokens, 480 epochs. Findings: > DLMs beat AR when tokens are limited, with >3× data potential. > A 1B DLM trained on just 1B tokens

thumb_up_off_alt1,1K

chat_bubble_outline27

repeat187

shareShare

wh

@nrehiew_

4 months ago

Let's talk about the GLM 4.5 models. The latest frontier open weights model out of China (and possibly the best at the moment?) with quite a bit of details in the paper.

thumb_up_off_alt770

chat_bubble_outline9

repeat72

shareShare

Dimitris Papailiopoulos

@dimitrispapail

4 months ago

Another interesting observation performing SGD on x-entropy loss on your text corpus is equivalent to REINFORCE, i.e., on-policy policy gradient, with binary reward "Did my model generate text from corpus"

thumb_up_off_alt258

chat_bubble_outline21

repeat16

shareShare

Feng Yao

@fengyao1909

4 months ago

Liyuan Liu (Lucas) Chengyu Dong Dinghuai Zhang 张鼎怀 Jingbo Shang Jianfeng Gao (2/4) What’s the 𝐬𝐞𝐜𝐫𝐞𝐭 𝐬𝐚𝐮𝐜𝐞? We build on our previous 𝐭𝐫𝐮𝐧𝐜𝐚𝐭𝐞𝐝 𝐢𝐦𝐩𝐨𝐫𝐭𝐚𝐧𝐜𝐞 𝐬𝐚𝐦𝐩𝐥𝐢𝐧𝐠 (𝐓𝐈𝐒) blog (fengyao.notion.site/off-policy-rl) to address this issue. Here’s a quick summary of how it works.

<a href="/LiyuanLucas/">Liyuan Liu (Lucas)</a> <a href="/chengyu77/">Chengyu Dong</a> <a href="/zdhnarsil/">Dinghuai Zhang 张鼎怀</a> <a href="/shangjingbo/">Jingbo Shang</a> <a href="/JianfengGao0217/">Jianfeng Gao</a> (2/4) What’s the 𝐬𝐞𝐜𝐫𝐞𝐭 𝐬𝐚𝐮𝐜𝐞?

We build on our previous 𝐭𝐫𝐮𝐧𝐜𝐚𝐭𝐞𝐝 𝐢𝐦𝐩𝐨𝐫𝐭𝐚𝐧𝐜𝐞 𝐬𝐚𝐦𝐩𝐥𝐢𝐧𝐠 (𝐓𝐈𝐒) blog (fengyao.notion.site/off-policy-rl) to address this issue. Here’s a quick summary of how it works.

thumb_up_off_alt9

chat_bubble_outline1

repeat1

shareShare

Mika Senghaas

@mikasenghaas

4 months ago

moving from vllm v0 to v1 made our async rl training crash! read how we fixed it we recently migrated from v0 to v1 as part of a larger refactor of prime-rl to make it easier-to-use, more performant and naturally async. we confirmed correct training dynamics on many

thumb_up_off_alt268

chat_bubble_outline7

repeat33

shareShare

Si-ze Zheng

@deeplyignorant

3 months ago

🎉 Excited to share: We’ve open-sourced Triton-distributed MegaKernel! A fresh, powerful take on MegaKernel for LLMs—built entirely on our Triton-distributed framework. github.com/ByteDance-Seed… Why it’s awesome? 🧩 Super programmable ⚡ Blazing performance 📊 Rock-solid precision

thumb_up_off_alt144

chat_bubble_outline3

repeat23

shareShare

Chris Alexiuk 🇨🇦

@llm_wizard

3 months ago

Check out this beefy list of stuff we released: research.nvidia.com/labs/adlr/NVID…

thumb_up_off_alt39

chat_bubble_outline1

repeat1

shareShare

surya

@suryasure05

3 months ago

I spent my summer building TinyTPU : An open source ML inference and training chip. it can do end to end inference + training ENTIRELY on chip. here's how I did it👇:

thumb_up_off_alt3,3K

chat_bubble_outline70

repeat337

shareShare

Yacine Mahdid

@yacinelearning

3 months ago

it's like top 2 min read if you are slow: yacinemahdid.com/p/muon-optimiz…

thumb_up_off_alt24

chat_bubble_outline1

repeat7

shareShare

Stuart Sul

@stuart_sul

3 months ago

MoE layers can be really slow. When training our coding models Cursor, they ate up 27–53% of training time. So we completely rebuilt it at the kernel level and transitioned to MXFP8. The result: 3.5x faster MoE layer and 1.5x end-to-end training speedup. We believe our

MoE layers can be really slow. When training our coding models <a href="/cursor_ai/">Cursor</a>, they ate up 27–53% of training time.

So we completely rebuilt it at the kernel level and transitioned to MXFP8. The result: 3.5x faster MoE layer and 1.5x end-to-end training speedup.

We believe our

thumb_up_off_alt381

chat_bubble_outline14

repeat53

shareShare

SemiAnalysis

@semianalysis_

3 months ago

H100 vs GB200 NVL72 Training Benchmarks Power, TCO, and Reliability Analysis, Software Improvement Over Time Joules per Token, TCO Per Million Tokens, MFU Tokens Per US Annual Household Energy Usage, DeepSeek 670B GB200 Unreliability, Backplane Downtime semianalysis.com/2025/08/20/h10…

thumb_up_off_alt157

chat_bubble_outline4

repeat18

shareShare

Edward Z. Yang

@ezyang

3 months ago

Live, from the Cadillac Lounge, is a new vacation blog post!

thumb_up_off_alt156

chat_bubble_outline5

repeat17

shareShare

Thien Tran

@gaunernst

3 months ago

Wrote a blogpost. Hopefully it's the first of many to come. Feedback is welcome 🤗 gau-nernst.github.io/fa-5090/

thumb_up_off_alt357

chat_bubble_outline17

repeat44

shareShare

elie

@eliebakouch

3 months ago

Motif 2.6B tech report is pretty insane, first time i see a model with differential attention and polynorm trained at scale! > It's trained on 2.5T of token, with a "data mixture schedule" to continuously adjust the mixture over training. > They use WSD with a "Simple moving

thumb_up_off_alt359

chat_bubble_outline16

repeat40

shareShare

Kimbo Chen

stochasm

Adam Zweiger

Xiangming Gu @ ICLR 2025

Feng Yao

Jonathan Chang

Guangxuan Xiao

Jinjie Ni @ ICLR'25 🇸🇬

wh

Dimitris Papailiopoulos

Feng Yao

Mika Senghaas

Si-ze Zheng

Chris Alexiuk 🇨🇦

surya

Yacine Mahdid

Stuart Sul

SemiAnalysis

Edward Z. Yang

Thien Tran

elie