Shane Bergsma (@shanebergsma) 's Twitter Profile
Shane Bergsma

@shanebergsma

Man bites data

ID: 482109376

linkhttps://sites.google.com/site/shaneabergsma/ calendar_today03-02-2012 14:53:26

206 Tweet

269 Followers

410 Following

Artificial Analysis (@artificialanlys) 's Twitter Profile Photo

Cerebras has set a new record for AI inference speed, serving Llama 3.1 8B at 1,850 output tokens/s and 70B at 446 output tokens/s. @CerebrasSystems has just launched their API inference offering, powered by their custom wafer-scale AI accelerator chips. Cerebras Inference is

Cerebras has set a new record for AI inference speed, serving Llama 3.1 8B at 1,850 output tokens/s and 70B at 446 output tokens/s.

@CerebrasSystems has just launched their API inference offering, powered by their custom wafer-scale AI accelerator chips.

Cerebras Inference is
Cerebras (@cerebrassystems) 's Twitter Profile Photo

It’s #ICLR2025 week, and we’re proud to share that Team Cerebras will be presenting their paper: "Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs" at ICLR 2026! Big congrats to the authors, your work is powering the future of AI compute.

Nolan Dey (@deynolan) 's Twitter Profile Photo

(1/7) Cerebras Paper drop: arxiv.org/abs/2505.01618 TLDR: We introduce CompleteP, which offers depth-wise hyperparameter (HP) transfer (Left), FLOP savings when training deep models (Middle), and a larger range of compute-efficient width/depth ratios (Right).  🧵 👇

(1/7) <a href="/CerebrasSystems/">Cerebras</a> Paper drop: arxiv.org/abs/2505.01618

TLDR: We introduce CompleteP, which offers depth-wise hyperparameter (HP) transfer (Left), FLOP savings when training deep models (Middle), and a larger range of compute-efficient width/depth ratios (Right).  🧵 👇
Daria Soboleva (@dmsobol) 's Twitter Profile Photo

Major finding #1: λ=0.1 used in the majority of LLMs is suboptimal! Our work shows that optimal weight decay (λ) scales linearly with batch size. Most researchers use the same λ regardless of batch size, leaving performance on the table.

Major finding #1: λ=0.1 used in the majority of LLMs is suboptimal!

Our work shows that optimal weight decay (λ) scales linearly with batch size. Most researchers use the same λ regardless of batch size, leaving performance on the table.
Shikai Qiu (@shikaiqiu) 's Twitter Profile Photo

Beautiful work on pretraining science using scaling collapse to precisely predict, debug, and tune LLM training from small-scale and partial runs. So much insights on going beyond μP!

Atli Kosson (@atlikosson) 's Twitter Profile Photo

The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵

The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work?

We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵
Shane Bergsma (@shanebergsma) 's Twitter Profile Photo

Wikipedia (one of the supreme achievements of humanity) doesn't get enough love, so just let me say, "thank you, Wikipedia."

@vyedin.bsky.social (@vyedin) 's Twitter Profile Photo

In an effort to foster a more cooperative spirit between different parts of my code, I no longer pass *arguments* to a function. Instead when one function calls another, it passes along some *gentle feedback*.