Kimbo Chen (@kimbochen) 's Twitter Profile
Kimbo Chen

@kimbochen

High-performance ML algorithms, compilers, and systems

ID: 2870711864

linkhttps://github.com/kimbochen/md-blogs calendar_today22-10-2014 10:53:31

1,1K Tweet

381 Followers

583 Following

stochasm (@stochasticchasm) 's Twitter Profile Photo

This paper answers a long-standing question I had and claims that decay + merging does not outperform merging alone, which simplifies things quite nicely

This paper answers a long-standing question I had and claims that decay + merging does not outperform merging alone, which simplifies things quite nicely
Adam Zweiger (@adamzweiger) 's Twitter Profile Photo

Here are all the architecture tricks used by gpt-oss: - Attention sinks - for each attention head, have a learned scalar such that softmax(qk) becomes softmax over [a_1, a_2, ..., a_T, sink]. Tokens don't have to attend to anything if all the attention scores are low! -

Xiangming Gu @ ICLR 2025 (@gu_xiangming) 's Twitter Profile Photo

I noticed that OpenAI added learnable bias to attention logits before softmax. After softmax, they deleted the bias. This is similar to what I have done in my ICLR2025 paper: openreview.net/forum?id=78Nn4โ€ฆ. I used learnable key bias and set corresponding value bias zero. In this way,

I noticed that <a href="/OpenAI/">OpenAI</a> added learnable bias to attention logits before softmax. After softmax, they deleted the bias. This is similar to what I have done in my ICLR2025 paper: openreview.net/forum?id=78Nn4โ€ฆ.
I used learnable key bias and set corresponding value bias zero. In this way,
Feng Yao (@fengyao1909) 's Twitter Profile Photo

Failing on ๐ฅ๐š๐ซ๐ ๐ž-๐ฌ๐œ๐š๐ฅ๐ž ๐‘๐‹ with VeRL? โš ๏ธ Mixing inference backend (๐ฏ๐‹๐‹๐Œ/๐’๐†๐‹๐š๐ง๐ ) with training backends (๐…๐’๐ƒ๐/๐Œ๐ž๐ ๐š๐ญ๐ซ๐จ๐ง) ๐ฌ๐ž๐œ๐ซ๐ž๐ญ๐ฅ๐ฒ ๐ญ๐ฎ๐ซ๐ง๐ฌ ๐ฒ๐จ๐ฎ๐ซ ๐‘๐‹ ๐ข๐ง๐ญ๐จ ๐จ๐Ÿ๐Ÿ-๐ฉ๐จ๐ฅ๐ข๐œ๐ฒ โ€” even if they share the same weights! ๐Ÿ“‰ย Blog:

Failing on ๐ฅ๐š๐ซ๐ ๐ž-๐ฌ๐œ๐š๐ฅ๐ž ๐‘๐‹ with VeRL?

โš ๏ธ Mixing inference backend (๐ฏ๐‹๐‹๐Œ/๐’๐†๐‹๐š๐ง๐ ) with training backends (๐…๐’๐ƒ๐/๐Œ๐ž๐ ๐š๐ญ๐ซ๐จ๐ง) ๐ฌ๐ž๐œ๐ซ๐ž๐ญ๐ฅ๐ฒ ๐ญ๐ฎ๐ซ๐ง๐ฌ ๐ฒ๐จ๐ฎ๐ซ ๐‘๐‹ ๐ข๐ง๐ญ๐จ ๐จ๐Ÿ๐Ÿ-๐ฉ๐จ๐ฅ๐ข๐œ๐ฒ โ€” even if they share the same weights!

๐Ÿ“‰ย Blog:
Jonathan Chang (@cccntu) 's Twitter Profile Photo

while we wait for gpt-5 to drop. Here is a flex attention tutorial for building a < 1000 LoC vllm from scratch jonathanc.net/blog/vllm-flexโ€ฆ

Guangxuan Xiao (@guangxuan_xiao) 's Twitter Profile Photo

I've written the full story of Attention Sinks โ€” a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models. For those interested in the details: hanlab.mit.edu/blog/streamingโ€ฆ

I've written the full story of Attention Sinks โ€” a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models.

For those interested in the details:
hanlab.mit.edu/blog/streamingโ€ฆ
Jinjie Ni @ ICLR'25 ๐Ÿ‡ธ๐Ÿ‡ฌ (@nijinjie) 's Twitter Profile Photo

Token crisis: solved. โœ… We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch โ€” up to 8B params, 480B tokens, 480 epochs. Findings: > DLMs beat AR when tokens are limited, with >3ร— data potential. > A 1B DLM trained on just 1B tokens

Token crisis: solved. โœ…

We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch โ€” up to 8B params, 480B tokens, 480 epochs.

Findings:
&gt;  DLMs beat AR when tokens are limited, with &gt;3ร— data potential.
&gt;  A 1B DLM trained on just 1B tokens
wh (@nrehiew_) 's Twitter Profile Photo

Let's talk about the GLM 4.5 models. The latest frontier open weights model out of China (and possibly the best at the moment?) with quite a bit of details in the paper.

Let's talk about the GLM 4.5 models.

The latest frontier open weights model out of China (and possibly the best at the moment?) with quite a bit of details in the paper.
Dimitris Papailiopoulos (@dimitrispapail) 's Twitter Profile Photo

Another interesting observation performing SGD on x-entropy loss on your text corpus is equivalent to REINFORCE, i.e., on-policy policy gradient, with binary reward "Did my model generate text from corpus"

Feng Yao (@fengyao1909) 's Twitter Profile Photo

Liyuan Liu (Lucas) Chengyu Dong Dinghuai Zhang ๅผ ้ผŽๆ€€ Jingbo Shang Jianfeng Gao (2/4) Whatโ€™s the ๐ฌ๐ž๐œ๐ซ๐ž๐ญ ๐ฌ๐š๐ฎ๐œ๐ž? We build on our previous ๐ญ๐ซ๐ฎ๐ง๐œ๐š๐ญ๐ž๐ ๐ข๐ฆ๐ฉ๐จ๐ซ๐ญ๐š๐ง๐œ๐ž ๐ฌ๐š๐ฆ๐ฉ๐ฅ๐ข๐ง๐  (๐“๐ˆ๐’) blog (fengyao.notion.site/off-policy-rl) to address this issue. Hereโ€™s a quick summary of how it works.

<a href="/LiyuanLucas/">Liyuan Liu (Lucas)</a> <a href="/chengyu77/">Chengyu Dong</a> <a href="/zdhnarsil/">Dinghuai Zhang ๅผ ้ผŽๆ€€</a> <a href="/shangjingbo/">Jingbo Shang</a> <a href="/JianfengGao0217/">Jianfeng Gao</a> (2/4) Whatโ€™s the ๐ฌ๐ž๐œ๐ซ๐ž๐ญ ๐ฌ๐š๐ฎ๐œ๐ž?

We build on our previous ๐ญ๐ซ๐ฎ๐ง๐œ๐š๐ญ๐ž๐ ๐ข๐ฆ๐ฉ๐จ๐ซ๐ญ๐š๐ง๐œ๐ž ๐ฌ๐š๐ฆ๐ฉ๐ฅ๐ข๐ง๐  (๐“๐ˆ๐’) blog (fengyao.notion.site/off-policy-rl) to address this issue. Hereโ€™s a quick summary of how it works.
Mika Senghaas (@mikasenghaas) 's Twitter Profile Photo

moving from vllm v0 to v1 made our async rl training crash! read how we fixed it we recently migrated from v0 to v1 as part of a larger refactor of prime-rl to make it easier-to-use, more performant and naturally async. we confirmed correct training dynamics on many

moving from vllm v0 to v1 made our async rl training crash! read how we fixed it

we recently migrated from v0 to v1 as part of a larger refactor of prime-rl to make it easier-to-use, more performant and naturally async. we confirmed correct training dynamics on many
Si-ze Zheng (@deeplyignorant) 's Twitter Profile Photo

๐ŸŽ‰ Excited to share: Weโ€™ve open-sourced Triton-distributed MegaKernel! A fresh, powerful take on MegaKernel for LLMsโ€”built entirely on our Triton-distributed framework. github.com/ByteDance-Seedโ€ฆ Why itโ€™s awesome? ๐Ÿงฉ Super programmable โšก Blazing performance ๐Ÿ“Š Rock-solid precision

๐ŸŽ‰ Excited to share: Weโ€™ve open-sourced Triton-distributed MegaKernel! A fresh, powerful take on MegaKernel for LLMsโ€”built entirely on our Triton-distributed framework.
github.com/ByteDance-Seedโ€ฆ

Why itโ€™s awesome?
๐Ÿงฉ Super programmable
โšก Blazing performance
๐Ÿ“Š Rock-solid precision
surya (@suryasure05) 's Twitter Profile Photo

I spent my summer building TinyTPU : An open source ML inference and training chip. it can do end to end inference + training ENTIRELY on chip. here's how I did it๐Ÿ‘‡:

Stuart Sul (@stuart_sul) 's Twitter Profile Photo

MoE layers can be really slow. When training our coding models Cursor, they ate up 27โ€“53% of training time. So we completely rebuilt it at the kernel level and transitioned to MXFP8. The result: 3.5x faster MoE layer and 1.5x end-to-end training speedup. We believe our

MoE layers can be really slow. When training our coding models <a href="/cursor_ai/">Cursor</a>, they ate up 27โ€“53% of training time.

So we completely rebuilt it at the kernel level and transitioned to MXFP8. The result: 3.5x faster MoE layer and 1.5x end-to-end training speedup.

We believe our
SemiAnalysis (@semianalysis_) 's Twitter Profile Photo

H100 vs GB200 NVL72 Training Benchmarks Power, TCO, and Reliability Analysis, Software Improvement Over Time Joules per Token, TCO Per Million Tokens, MFU Tokens Per US Annual Household Energy Usage, DeepSeek 670B GB200 Unreliability, Backplane Downtime semianalysis.com/2025/08/20/h10โ€ฆ

elie (@eliebakouch) 's Twitter Profile Photo

Motif 2.6B tech report is pretty insane, first time i see a model with differential attention and polynorm trained at scale! > It's trained on 2.5T of token, with a "data mixture schedule" to continuously adjust the mixture over training. > They use WSD with a "Simple moving

Motif 2.6B tech report is pretty insane, first time i see a model with differential attention and polynorm trained at scale!

&gt; It's trained on 2.5T of token, with a "data mixture schedule" to continuously adjust the mixture over training.
&gt; They use WSD with a "Simple moving