Yushun Zhang (@ericzhang0410) Twitter Tweets • TwiCopy

Yushun Zhang

@ericzhang0410

+ Follow

Phd student at The Chinese University of Hong Kong, shenzhen, China,

Working on optimization and LLMs zyushun.github.io

ID: 1239780017040580610

calendar_today17-03-2020 05:06:12

326 Tweet

279 Followers

357 Following

Kyunghyun Cho

@kchonyc

6 months ago

on my way back to NYC, i met wise Leon Bottou in the airport. we talked. then i told him "you should tweet that!" and, he delivered much more than a tweet: a blog post with thoughts and insights on AI research only he can deliver this clearly and succinctly.

thumb_up_off_alt465

chat_bubble_outline9

repeat56

shareShare

arXiv math.OC Optimization and Control

@mathocb

6 months ago

Henry Shugart, Jason M. Altschuler: Negative Stepsizes Make Gradient-Descent-Ascent Converge arxiv.org/abs/2505.01423 arxiv.org/pdf/2505.01423 arxiv.org/html/2505.01423

thumb_up_off_alt26

chat_bubble_outline0

repeat3

shareShare

Yushun Zhang

@ericzhang0410

5 months ago

Check out this excellent work led by Dmitry Dmitry Rybin ! We discovered a new algorithm to compute the matrix product XX^t with 5% fewer number of multiplications

thumb_up_off_alt17

chat_bubble_outline0

repeat2

shareShare

Yushun Zhang

@ericzhang0410

5 months ago

🧐This is definitely worth reading. I've been puzzling with this phenomenon for quite a while

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

Yacine Mahdid

@yacinelearning

4 months ago

man, scientists working on optimizing matrix multiplications have oppenheimer level of aura - use a RL agent to spit out heckload of bilinear products - slap two MILP to combine and filter those - iterate on top of a Large Neighborhood Search flow until it’s fast fast what the

thumb_up_off_alt2,2K

chat_bubble_outline29

repeat183

shareShare

Yushun Zhang

@ericzhang0410

4 months ago

Dear Professors who are running the ICML board election, single column please, and we will definitely vote for you 😀 #icml2025 #ICML

thumb_up_off_alt5

chat_bubble_outline0

repeat1

shareShare

Kimi.ai

@kimi_moonshot

4 months ago

🚀 Hello, Kimi K2! Open-Source Agentic Model! 🔹 1T total / 32B active MoE model 🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models 🔹Strong in coding and agentic tasks 🐤 Multimodal & thought-mode not supported for now With Kimi K2, advanced agentic intelligence

thumb_up_off_alt3,3K

chat_bubble_outline158

repeat614

shareShare

Yuchen Jin

@yuchenj_uw

4 months ago

Holy shit. Kimi K2 was pre-trained on 15.5T tokens using MuonClip with zero training spike. Muon has officially scaled to the 1-trillion-parameter LLM level. Many doubted it could scale, but here we are. So proud of the Moum team: Keller Jordan, Vlado Boza, You Jiacheng,

thumb_up_off_alt1,1K

chat_bubble_outline48

repeat138

shareShare

Yushun Zhang

@ericzhang0410

4 months ago

Awesome! Kaiyue Wen this is related to our discussion before.

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Benny (Yufei) Chen

@the_bunny_chen

4 months ago

I built an interactive visualization tool to understand how MuonClip helped with Kimi.ai 's K2 training! Try it yourself: …-app-644257448872.us-central1.run.app (1/11)

thumb_up_off_alt124

chat_bubble_outline2

repeat25

shareShare

Laker Newhouse

@lakernewhouse

3 months ago

[1/9] We created a performant Lipschitz transformer by spectrally regulating the weights—without using activation stability tricks: no layer norm, QK norm, or logit softcapping. We think this may address a “root cause” of unstable training.