 
                                Yushun Zhang
@ericzhang0410
Phd student at The Chinese University of Hong Kong, shenzhen, China,
Working on optimization and LLMs zyushun.github.io
ID: 1239780017040580610
17-03-2020 05:06:12
326 Tweet
279 Followers
357 Following
 
         
         
        Check out this excellent work led by Dmitry Dmitry Rybin ! We discovered a new algorithm to compute the matrix product XX^t with 5% fewer number of multiplications
 
         
         
         
         
        Holy shit. Kimi K2 was pre-trained on 15.5T tokens using MuonClip with zero training spike. Muon has officially scaled to the 1-trillion-parameter LLM level. Many doubted it could scale, but here we are. So proud of the Moum team: Keller Jordan, Vlado Boza, You Jiacheng,
 
                        
                    
                    
                    
                 
        Awesome! Kaiyue Wen this is related to our discussion before.
 
         
         
         
         
         
                         
                         
                        ![Laker Newhouse (@lakernewhouse) on Twitter photo [1/9] We created a performant Lipschitz transformer by spectrally regulating the weights—without using activation stability tricks: no layer norm, QK norm, or logit softcapping. We think this may address a “root cause” of unstable training. [1/9] We created a performant Lipschitz transformer by spectrally regulating the weights—without using activation stability tricks: no layer norm, QK norm, or logit softcapping. We think this may address a “root cause” of unstable training.](https://pbs.twimg.com/media/GwPX6BxXkAAc-JP.jpg)