Kimbo Chen
@kimbochen
High-performance ML algorithms, compilers, and systems
ID: 2870711864
https://github.com/kimbochen/md-blogs 22-10-2014 10:53:31
1,1K Tweet
381 Followers
583 Following
I noticed that OpenAI added learnable bias to attention logits before softmax. After softmax, they deleted the bias. This is similar to what I have done in my ICLR2025 paper: openreview.net/forum?id=78Nn4โฆ. I used learnable key bias and set corresponding value bias zero. In this way,
Failing on ๐ฅ๐๐ซ๐ ๐-๐ฌ๐๐๐ฅ๐ ๐๐ with VeRL? โ ๏ธ Mixing inference backend (๐ฏ๐๐๐/๐๐๐๐๐ง๐ ) with training backends (๐ ๐๐๐/๐๐๐ ๐๐ญ๐ซ๐จ๐ง) ๐ฌ๐๐๐ซ๐๐ญ๐ฅ๐ฒ ๐ญ๐ฎ๐ซ๐ง๐ฌ ๐ฒ๐จ๐ฎ๐ซ ๐๐ ๐ข๐ง๐ญ๐จ ๐จ๐๐-๐ฉ๐จ๐ฅ๐ข๐๐ฒ โ even if they share the same weights! ๐ย Blog:
Liyuan Liu (Lucas) Chengyu Dong Dinghuai Zhang ๅผ ้ผๆ Jingbo Shang Jianfeng Gao (2/4) Whatโs the ๐ฌ๐๐๐ซ๐๐ญ ๐ฌ๐๐ฎ๐๐? We build on our previous ๐ญ๐ซ๐ฎ๐ง๐๐๐ญ๐๐ ๐ข๐ฆ๐ฉ๐จ๐ซ๐ญ๐๐ง๐๐ ๐ฌ๐๐ฆ๐ฉ๐ฅ๐ข๐ง๐ (๐๐๐) blog (fengyao.notion.site/off-policy-rl) to address this issue. Hereโs a quick summary of how it works.
H100 vs GB200 NVL72 Training Benchmarks Power, TCO, and Reliability Analysis, Software Improvement Over Time Joules per Token, TCO Per Million Tokens, MFU Tokens Per US Annual Household Energy Usage, DeepSeek 670B GB200 Unreliability, Backplane Downtime semianalysis.com/2025/08/20/h10โฆ