Quanquan Gu (@quanquangu) 's Twitter Profile
Quanquan Gu

@quanquangu

Professor @UCLA, Research Scientist at ByteDance | Recent work: SPIN, SPPO, DPLM, GPM, CryoFM, MARS, TPA | Opinions are my own

ID: 901303999529312256

linkhttp://www.cs.ucla.edu/~qgu/ calendar_today26-08-2017 04:43:13

1,1K Tweet

13,13K Followers

1,1K Following

Aakash Kumar Nain (@a_k_nain) 's Twitter Profile Photo

Does a better pretraining loss result in better performance on downstream tasks? Do downstream scaling laws exist? What kind of relationship exists between pretraining loss and performance on downstream tasks? This latest paper from NYU studies the reliability of downstream

Does a better pretraining loss result in better performance on downstream tasks? Do downstream scaling laws exist? What kind of relationship exists between pretraining loss and performance on downstream tasks? This latest paper from NYU studies the reliability of downstream
Shikai Qiu (@shikaiqiu) 's Twitter Profile Photo

📉Learning rate decay is super effective and sometimes mysterious, but the simplest model of SGD on quadratic loss w/ noisy gradients almost perfectly predicts loss curves of transformers trained with Adam on real data, across schedules, model sizes, and token budgets. 1/4

📉Learning rate decay is super effective and sometimes mysterious, but the simplest model of SGD on quadratic loss w/ noisy gradients almost perfectly predicts loss curves of transformers trained with Adam on real data, across schedules, model sizes, and token budgets.

1/4
Quanquan Gu (@quanquangu) 's Twitter Profile Photo

Can’t make it to #ICML2025 this year. People ask why I’m so obsessed with pretraining and scaling. Simple: the AGI era is here. I refuse to be irrelevant.

Quanquan Gu (@quanquangu) 's Twitter Profile Photo

从理论转到大模型,一路走来不讨喜。 有人不适应你的变化,有人不希望你真的做成。 Losers and haters make noise. Builders build. Feel the AGI!

Volkan Cevher (@cevherlions) 's Twitter Profile Photo

Excited to give a tutorial with Leena C Vankadara on Training Neural Networks at Any Scale (TRAINS) ICML Conference at 13:30 (West Ballroom A). Our slides can be found here: go.epfl.ch/ICML25TRAINS Please join us.

PapersAnon (@papers_anon) 's Twitter Profile Photo

Mixture of Raytraced Experts Stacked MoE architecture that can dynamically select sequences of experts, producing computational graphs of variable width and depth. Allows predictions with increasing accuracy as the computation cycles through the experts' sequence. Links below

Mixture of Raytraced Experts

Stacked MoE architecture that can dynamically select sequences of experts, producing computational graphs of variable width and depth. Allows predictions with increasing accuracy as the computation cycles through the experts' sequence.

Links below
Quanquan Gu (@quanquangu) 's Twitter Profile Photo

μP plays a central role in scaling large language models, known for hyperparameter transfer & stability. But don’t overlook its feature learning power. 📈

Qingyue Zhao (@zhaoqingyue) 's Twitter Profile Photo

Drop by our poster in Ballroom A, West Building to check our cute analysis techniques and a rich set of future directions opened by our work.

Drop by our poster in Ballroom A, West Building to check our cute analysis techniques and a rich set of future directions opened by our work.
Francis Bach (@bachfrancis) 's Twitter Profile Photo

Tired of lengthy computations to derive scaling laws? This post is made for you: discover the sharpness of the z-transform! francisbach.com/z-transform/

Mikhail Parakhin (@mparakhin) 's Twitter Profile Photo

Since nobody asked :-), here is my list of papers not to be missed from ICML: 1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it). 2) MARS: Unleashing the Power of Variance Reduction for Training Large Models 3) ...

Sheryl Hsu (@sherylhsu02) 's Twitter Profile Photo

The model solves these problems without tools like lean or coding, it just uses natural language, and also only has 4.5 hours. We see the model reason at a very high level - trying out different strategies, making observations from examples, and testing hypothesis.

Aryo Pradipta Gema (@aryopg) 's Twitter Profile Photo

New Anthropic Research: “Inverse Scaling in Test-Time Compute” We found cases where longer reasoning leads to lower accuracy. Our findings suggest that naïve scaling of test-time compute may inadvertently reinforce problematic reasoning patterns. 🧵

New Anthropic Research: “Inverse Scaling in Test-Time Compute”

We found cases where longer reasoning leads to lower accuracy.
Our findings suggest that naïve scaling of test-time compute may inadvertently reinforce problematic reasoning patterns.

🧵