Nolan Dey (@deynolan) 's Twitter Profile
Nolan Dey

@deynolan

Research Scientist @ Cerebras Systems

ID: 1508916574132002817

linkhttps://ndey96.github.io calendar_today29-03-2022 21:18:55

32 Tweet

435 Followers

36 Following

Cerebras (@cerebrassystems) 's Twitter Profile Photo

🚨 New podcast: how we made Cerebras-GPT with Nolan Dey and Quentin Anthony. A deep look on what it's like to train on Cerebras and the tradeoffs between compute and inference optimal training. youtube.com/watch?v=QmmNgi…

Cerebras (@cerebrassystems) 's Twitter Profile Photo

📣 New dataset drop! Introducing SlimPajama-627B: the largest extensively deduplicated, multi-corpora, open-source dataset for training large language models. 🧵cerebras.net/blog/slimpajam…

📣 New dataset drop!
Introducing SlimPajama-627B: the largest extensively deduplicated, multi-corpora, open-source dataset for training large language models. 🧵cerebras.net/blog/slimpajam…
Cerebras (@cerebrassystems) 's Twitter Profile Photo

Cerebras BTLM-3B-8K model crosses 1M downloads🤯 It's the #1 ranked 3B language model on Hugging Face! A big thanks to all the devs out there building on top of open source models 🙌

Cerebras BTLM-3B-8K model crosses 1M downloads🤯
It's the #1 ranked 3B language model on <a href="/huggingface/">Hugging Face</a>!
A big thanks to all the devs out there building on top of open source models 🙌
Cerebras (@cerebrassystems) 's Twitter Profile Photo

We just dropped the BTLM-3B-8K paper on arXiv! It distills our recipe for training SOTA LLMs: - Extensively deduplicated dataset (SlimPajama) - Hyperparameter search using muP - Variable sequence length training + ALiBi - Aggressive LR decay arxiv.org/abs/2309.11568

We just dropped the BTLM-3B-8K paper on arXiv! It distills our recipe for training SOTA LLMs:
- Extensively deduplicated dataset (SlimPajama)
- Hyperparameter search using muP
- Variable sequence length training + ALiBi
- Aggressive LR decay
arxiv.org/abs/2309.11568
Cerebras (@cerebrassystems) 's Twitter Profile Photo

📣 Paper drop: Position Interpolation Improves ALiBi Extrapolation We found a simple method to 2x the context length of models that use ALiBi. This lets models like BTLM-3B-8K and MPT-7B-8K run high quality inference at up to 16K with no additional fine tuning. 👇

Vithu Thangarasa (@vithursant19) 's Twitter Profile Photo

Successfully ported Andrej Karpathy's nanoGPT to the new Apple MLX framework, possibly enabling quick prototyping of training GPT style models on Mac GPUs. Check out the project: github.com/vithursant/nan…. Got a new M3 Pro and wanted to learn about MLX over the holidays lol.

Cerebras (@cerebrassystems) 's Twitter Profile Photo

(1/n) Paper drop: arxiv.org/abs/2405.15743 TLDR: We introduce the sparse maximal update parameterization (SμPar), which ensures optimal HPs remain the same for any width or sparsity level. This dramatically reduces HP tuning costs, allowing SμPar to achieve superior losses. 🧵 👇

(1/n) Paper drop: arxiv.org/abs/2405.15743

TLDR: We introduce the sparse maximal update parameterization (SμPar), which ensures optimal HPs remain the same for any width or sparsity level. This dramatically reduces HP tuning costs, allowing SμPar to achieve superior losses. 🧵 👇
Davis Blalock (@davisblalock) 's Twitter Profile Photo

So, uh, it turns out that 30+ years of neural net sparsity research have been confounded by optimal hyperparameters varying with sparsity level...

EleutherAI (@aieleuther) 's Twitter Profile Photo

🎉We're excited to announce our joint work with @Cerebras on a new guide to Maximal Update Parameterization (μP) and μTransfer!🎉 This practitioner's guide (and implementation) aims to make μP more accessible and easier to implement for the broader training community. 🧵

Nolan Dey (@deynolan) 's Twitter Profile Photo

Published "Neuron-based explanations of neural networks sacrifice completeness and interpretability" in TMLR 2025! TL;DR: The most important principal components provide more complete and interpretable explanations than the most important neurons. ndey96.github.io/neuron-explana…

Shane Bergsma (@shanebergsma) 's Twitter Profile Photo

Power Lines paper now out: arxiv.org/abs/2505.13738 TL;DR - we identify how AdamW's weight decay should scale with batch size, dataset size, and model size in LLM pre-training. We also investigate the scaling of both "optimal" and "critical" batch size.

Shane Bergsma (@shanebergsma) 's Twitter Profile Photo

(1/4) Cerebras Hot off the presses 🔥📄arxiv.org/abs/2509.25087 If you're spending $1B to train an LLM, you need to know it’s on track—every step of the way. With optimal AdamW τ + fixed TPP, loss curves collapse to a universal path → an early-warning signal for training.

(1/4)
<a href="/CerebrasSystems/">Cerebras</a> Hot off the presses 🔥📄arxiv.org/abs/2509.25087
If you're spending $1B to train an LLM, you need to know it’s on track—every step of the way.
With optimal AdamW τ + fixed TPP, loss curves collapse to a universal path → an early-warning signal for training.
Shane Bergsma (@shanebergsma) 's Twitter Profile Photo

Another new preprint from @CerebrasSystems 🚨📄- this time on training *re-evaluation* curves (TRECs) for data curriculums in LLMs. Everyone sticks high-quality data at the end of training… we show the sweet spot is often earlier — and we can predict it. arxiv.org/abs/2509.25380

Another new preprint from @CerebrasSystems 🚨📄- this time on training *re-evaluation* curves (TRECs) for data curriculums in LLMs.
Everyone sticks high-quality data at the end of training… we show the sweet spot is often earlier — and we can predict it.
arxiv.org/abs/2509.25380