Quanquan Gu (@quanquangu) Twitter Tweets • TwiCopy

Quanquan Gu

@quanquangu

+ Follow

Professor @UCLA, Research Scientist at ByteDance | Recent work: SPIN, SPPO, DPLM, GPM, CryoFM, MARS, TPA | Opinions are my own

ID: 901303999529312256

linkhttp://www.cs.ucla.edu/~qgu/ calendar_today26-08-2017 04:43:13

1,1K Tweet

13,13K Followers

1,1K Following

Gate.io

@gate_io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Does a better pretraining loss result in better performance on downstream tasks? Do downstream scaling laws exist? What kind of relationship exists between pretraining loss and performance on downstream tasks? This latest paper from NYU studies the reliability of downstream

thumb_up_off_alt22

chat_bubble_outline2

repeat4

shareShare

Shikai Qiu

@shikaiqiu

15 days ago

📉Learning rate decay is super effective and sometimes mysterious, but the simplest model of SGD on quadratic loss w/ noisy gradients almost perfectly predicts loss curves of transformers trained with Adam on real data, across schedules, model sizes, and token budgets. 1/4

thumb_up_off_alt148

chat_bubble_outline3

repeat20

shareShare

Quanquan Gu

@quanquangu

14 days ago

Can’t make it to #ICML2025 this year. People ask why I’m so obsessed with pretraining and scaling. Simple: the AGI era is here. I refuse to be irrelevant.

thumb_up_off_alt96

chat_bubble_outline2

repeat5

shareShare

Quanquan Gu

@quanquangu

14 days ago

从理论转到大模型，一路走来不讨喜。有人不适应你的变化，有人不希望你真的做成。 Losers and haters make noise. Builders build. Feel the AGI!

thumb_up_off_alt147

chat_bubble_outline6

repeat2

shareShare

Volkan Cevher

@cevherlions

13 days ago

Excited to give a tutorial with Leena C Vankadara on Training Neural Networks at Any Scale (TRAINS) ICML Conference at 13:30 (West Ballroom A). Our slides can be found here: go.epfl.ch/ICML25TRAINS Please join us.

thumb_up_off_alt84

chat_bubble_outline3

repeat12

shareShare

PapersAnon

@papers_anon

10 days ago

Mixture of Raytraced Experts Stacked MoE architecture that can dynamically select sequences of experts, producing computational graphs of variable width and depth. Allows predictions with increasing accuracy as the computation cycles through the experts' sequence. Links below

thumb_up_off_alt146

chat_bubble_outline1

repeat27

shareShare

Quanquan Gu

@quanquangu

10 days ago

μP plays a central role in scaling large language models, known for hyperparameter transfer & stability. But don’t overlook its feature learning power. 📈

thumb_up_off_alt19

chat_bubble_outline0

repeat1

shareShare

Qingyue Zhao

@zhaoqingyue

9 days ago

Drop by our poster in Ballroom A, West Building to check our cute analysis techniques and a rich set of future directions opened by our work.

thumb_up_off_alt9

chat_bubble_outline1

repeat2

shareShare

Francis Bach

@bachfrancis

9 days ago

Tired of lengthy computations to derive scaling laws? This post is made for you: discover the sharpness of the z-transform! francisbach.com/z-transform/

thumb_up_off_alt266

chat_bubble_outline2

repeat38

shareShare

Mikhail Parakhin

@mparakhin

9 days ago

Since nobody asked :-), here is my list of papers not to be missed from ICML: 1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it). 2) MARS: Unleashing the Power of Variance Reduction for Training Large Models 3) ...

thumb_up_off_alt432

chat_bubble_outline6

repeat32

shareShare

Quanquan Gu

@quanquangu

8 days ago

In this economy, you have to choose: feed your model or feed yourself.

thumb_up_off_alt30

chat_bubble_outline3

repeat1

shareShare

Sheryl Hsu

@sherylhsu02

8 days ago

The model solves these problems without tools like lean or coding, it just uses natural language, and also only has 4.5 hours. We see the model reason at a very high level - trying out different strategies, making observations from examples, and testing hypothesis.

thumb_up_off_alt594

chat_bubble_outline13

repeat39

shareShare

Quanquan Gu

@quanquangu

8 days ago

Wait… this model didn’t even use Lean? That’s insane. Big congrats to the OpenAI team. That’s incredible work!

thumb_up_off_alt376

chat_bubble_outline6

repeat11

shareShare

Umar Jamil

@hkproj

5 days ago

Mistral started it DeepSeek scaled it Kimi K2 confirmed it: always more convenient to train an MoE

thumb_up_off_alt1,1K

chat_bubble_outline14

repeat53

shareShare

Aryo Pradipta Gema

@aryopg

5 days ago

New Anthropic Research: “Inverse Scaling in Test-Time Compute” We found cases where longer reasoning leads to lower accuracy. Our findings suggest that naïve scaling of test-time compute may inadvertently reinforce problematic reasoning patterns. 🧵