Nolan Dey (@deynolan) Twitter Tweets • TwiCopy

Cerebras

3 years ago

🚨 New podcast: how we made Cerebras-GPT with Nolan Dey and Quentin Anthony. A deep look on what it's like to train on Cerebras and the tradeoffs between compute and inference optimal training. youtube.com/watch?v=QmmNgi…

thumb_up_off_alt18

chat_bubble_outline0

repeat5

shareShare

Cerebras

@cerebrassystems

2 years ago

📣 New dataset drop! Introducing SlimPajama-627B: the largest extensively deduplicated, multi-corpora, open-source dataset for training large language models. 🧵cerebras.net/blog/slimpajam…

thumb_up_off_alt658

chat_bubble_outline13

repeat182

shareShare

Cerebras

@cerebrassystems

2 years ago

Cerebras BTLM-3B-8K model crosses 1M downloads🤯 It's the #1 ranked 3B language model on Hugging Face! A big thanks to all the devs out there building on top of open source models 🙌

Cerebras BTLM-3B-8K model crosses 1M downloads🤯
It's the #1 ranked 3B language model on <a href="/huggingface/">Hugging Face</a>!
A big thanks to all the devs out there building on top of open source models 🙌

thumb_up_off_alt196

chat_bubble_outline4

repeat52

shareShare

Cerebras

@cerebrassystems

2 years ago

We just dropped the BTLM-3B-8K paper on arXiv! It distills our recipe for training SOTA LLMs: - Extensively deduplicated dataset (SlimPajama) - Hyperparameter search using muP - Variable sequence length training + ALiBi - Aggressive LR decay arxiv.org/abs/2309.11568

thumb_up_off_alt112

chat_bubble_outline2

repeat27

shareShare

Cerebras

@cerebrassystems

2 years ago

📣 Paper drop: Position Interpolation Improves ALiBi Extrapolation We found a simple method to 2x the context length of models that use ALiBi. This lets models like BTLM-3B-8K and MPT-7B-8K run high quality inference at up to 16K with no additional fine tuning. 👇

thumb_up_off_alt75

chat_bubble_outline2

repeat16

shareShare

Vithu Thangarasa

@vithursant19

2 years ago

Successfully ported Andrej Karpathy's nanoGPT to the new Apple MLX framework, possibly enabling quick prototyping of training GPT style models on Mac GPUs. Check out the project: github.com/vithursant/nan…. Got a new M3 Pro and wanted to learn about MLX over the holidays lol.

thumb_up_off_alt94

chat_bubble_outline3

repeat16

shareShare

Cerebras

@cerebrassystems

a year ago

(1/n) Paper drop: arxiv.org/abs/2405.15743 TLDR: We introduce the sparse maximal update parameterization (SμPar), which ensures optimal HPs remain the same for any width or sparsity level. This dramatically reduces HP tuning costs, allowing SμPar to achieve superior losses. 🧵 👇

thumb_up_off_alt177

chat_bubble_outline4

repeat36

shareShare

Davis Blalock

@davisblalock

a year ago

So, uh, it turns out that 30+ years of neural net sparsity research have been confounded by optimal hyperparameters varying with sparsity level...

thumb_up_off_alt114

chat_bubble_outline4

repeat8

shareShare

EleutherAI

@aieleuther

a year ago

🎉We're excited to announce our joint work with @Cerebras on a new guide to Maximal Update Parameterization (μP) and μTransfer!🎉 This practitioner's guide (and implementation) aims to make μP more accessible and easier to implement for the broader training community. 🧵

thumb_up_off_alt183

chat_bubble_outline2

repeat28

shareShare

Nolan Dey

@deynolan

8 months ago

Published "Neuron-based explanations of neural networks sacrifice completeness and interpretability" in TMLR 2025! TL;DR: The most important principal components provide more complete and interpretable explanations than the most important neurons. ndey96.github.io/neuron-explana…

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Shane Bergsma

@shanebergsma

6 months ago

Power Lines paper now out: arxiv.org/abs/2505.13738 TL;DR - we identify how AdamW's weight decay should scale with batch size, dataset size, and model size in LLM pre-training. We also investigate the scaling of both "optimal" and "critical" batch size.

thumb_up_off_alt96

chat_bubble_outline1

repeat17

shareShare

Shane Bergsma

@shanebergsma

2 months ago

(1/4) Cerebras Hot off the presses 🔥📄arxiv.org/abs/2509.25087 If you're spending $1B to train an LLM, you need to know it’s on track—every step of the way. With optimal AdamW τ + fixed TPP, loss curves collapse to a universal path → an early-warning signal for training.

(1/4)
<a href="/CerebrasSystems/">Cerebras</a> Hot off the presses 🔥📄arxiv.org/abs/2509.25087
If you're spending $1B to train an LLM, you need to know it’s on track—every step of the way.
With optimal AdamW τ + fixed TPP, loss curves collapse to a universal path → an early-warning signal for training.

thumb_up_off_alt26

chat_bubble_outline2

repeat8

shareShare

Shane Bergsma

@shanebergsma

2 months ago

Another new preprint from @CerebrasSystems 🚨📄- this time on training *re-evaluation* curves (TRECs) for data curriculums in LLMs. Everyone sticks high-quality data at the end of training… we show the sweet spot is often earlier — and we can predict it. arxiv.org/abs/2509.25380

thumb_up_off_alt7

chat_bubble_outline1

repeat1

shareShare