Neil Band (@neilbband) 's Twitter Profile
Neil Band

@neilbband

PhD student @StanfordAILab @StanfordNLP @Stanford advised by Tatsunori Hashimoto and Tengyu Ma.
Prev: @OATML_Oxford @CompSciOxford

ID: 1311274045581791234

linkhttp://nband.github.io calendar_today30-09-2020 11:58:11

139 Tweet

698 Followers

535 Following

Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

Synthetic continued pretraining Proposes to bridge the sample-inefficiency of pretraining with synthetic continued pretraining: continued pretraining on a large corpus synthetically generated from a small domain-specific corpus arxiv.org/abs/2409.07431

Synthetic continued pretraining

Proposes to bridge the sample-inefficiency of pretraining with synthetic continued pretraining: continued pretraining on a large corpus synthetically generated from a small domain-specific corpus

arxiv.org/abs/2409.07431
Zitong Yang (@zitongyang0) 's Twitter Profile Photo

Grab your favorite preprint of the week: how can you put its knowledge in your LM’s parameters? Continued pretraining (CPT) works well with >10B tokens, but the preprint is <10K. Synthetic CPT downscales CPT to such small, targeted domains. 📜: arxiv.org/abs/2409.07431 🧵👇

Grab your favorite preprint of the week: how can you put its knowledge in your LM’s parameters? Continued pretraining (CPT) works well with &gt;10B tokens, but the preprint is &lt;10K.

Synthetic CPT downscales CPT to such small, targeted domains.

📜: arxiv.org/abs/2409.07431

🧵👇
Neil Band (@neilbband) 's Twitter Profile Photo

Really enjoyed working on Synthetic Continued Pretraining with Zitong Yang*, Shuangping Li, Emmanuel Candes, and Tatsunori Hashimoto! A simple approach to continue pretraining (CPT) on as little as ~1M tokens: Small specialized corpus -> synthesize a large, diverse corpus -> CPT

Jon Saad-Falcon (@jonsaadfalcon) 's Twitter Profile Photo

What is the best way to spend your inference compute budget to create LLM systems greater than the sum of their parts? In our latest paper, we present Archon, an architecture search framework for inference-time techniques! Archon is enabled by inference-time architecture search

Neil Band (@neilbband) 's Twitter Profile Photo

A CoPilot for tutors + a rigorous study with 1,800 students showing it actually helps them master concepts better 🔥 Awesome work by Rose and team!

Tanishq Kumar (@tanishqkumar07) 's Twitter Profile Photo

[1/7] New paper alert! Heard about the BitNet hype or that Llama-3 is harder to quantize? Our new work studies both! We formulate scaling laws for precision, across both pre and post-training arxiv.org/pdf/2411.04330. TLDR; - Models become harder to post-train quantize as they

[1/7] New paper alert! Heard about the BitNet hype or that Llama-3 is harder to quantize? Our new work studies both! We formulate scaling laws for precision, across both pre and post-training arxiv.org/pdf/2411.04330. TLDR;

- Models become harder to post-train quantize as they
Nicole Meister (@nicole__meister) 's Twitter Profile Photo

Prior work has used LLMs to simulate survey responses, yet their ability to match the distribution of views remains uncertain. Our new paper [arxiv.org/pdf/2411.05403] introduces a benchmark to evaluate how distributionally aligned LLMs are with human opinions. 🧵

Prior work has used LLMs to simulate survey responses, yet their ability to match the distribution of views remains uncertain.

Our new paper [arxiv.org/pdf/2411.05403] introduces a benchmark to evaluate how distributionally aligned LLMs are with human opinions.

🧵
Etash Guha @ ICLR (@etash_guha) 's Twitter Profile Photo

I’m super happy to have co-led the Evalchemy 🧪 team! I’ve been personally wanting a simple and fast framework for running a host of common post-training evals 📚 for a while now and this tool has streamlined a lot of my research. We hope that it helps you out too!

Teddi Worledge (@teddiworledge) 's Twitter Profile Photo

🧵LLMs are great at synthesizing info, but unreliable at citing sources. Search engines are the opposite. What lies between them? Our new paper runs human evals on 7 systems across the✨extractive-abstractive spectrum✨for utility, citation quality, time-to-verify, & fluency!

🧵LLMs are great at synthesizing info, but unreliable at citing sources. Search engines are the opposite. What lies between them?

Our new paper runs human evals on 7 systems across the✨extractive-abstractive spectrum✨for utility, citation quality, time-to-verify, &amp; fluency!
Jaya Gupta (@jayagup10) 's Twitter Profile Photo

Services as Software Part 3 is here. And it's not about opportunity, it's about extinction. Why? Services as Software isn't just a $4.6T opportunity. It's the end of software as we know it. The $300B enterprise software industry is facing its asteroid moment. SAP, Oracle ,

Niklas Muennighoff (@muennighoff) 's Twitter Profile Photo

DeepSeek r1 is exciting but misses OpenAI’s test-time scaling plot and needs lots of data. We introduce s1 reproducing o1-preview scaling & performance with just 1K samples & a simple test-time intervention. 📜arxiv.org/abs/2501.19393

DeepSeek r1 is exciting but misses OpenAI’s test-time scaling plot and needs lots of data.

We introduce s1 reproducing o1-preview scaling &amp; performance with just 1K samples &amp; a simple test-time intervention.

📜arxiv.org/abs/2501.19393
Tengyu Ma (@tengyuma) 's Twitter Profile Photo

RL + CoT works great for DeepSeek-R1 & o1, but:  1️⃣ Linear-in-log scaling in train & test-time compute 2️⃣ Likely bounded by difficulty of training problems Meet STP—a self-play algorithm that conjectures & proves indefinitely, scaling better! 🧠⚡🧵🧵 arxiv.org/abs/2502.00212

RL + CoT works great for DeepSeek-R1 &amp; o1, but: 

1️⃣ Linear-in-log scaling in train &amp; test-time compute
2️⃣ Likely bounded by difficulty of training problems

Meet STP—a self-play algorithm that conjectures &amp; proves indefinitely, scaling better! 🧠⚡🧵🧵

arxiv.org/abs/2502.00212
Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

Reasoning to Learn from Latent Thoughts "Motivated by how humans apply deliberate thinking to learn from limited data, we train an LM to infer (or “decompress”) latent thoughts underlying the highly compressed observed data. These synthesized latent thoughts augment the raw

Reasoning to Learn from Latent Thoughts

"Motivated by how humans apply deliberate thinking to learn from limited data, we train an LM to infer (or “decompress”) latent thoughts underlying the highly compressed observed data. These synthesized latent thoughts augment the raw
Yangjun Ruan (@yangjunr) 's Twitter Profile Photo

New paper on synthetic pretraining! We show LMs can synthesize their own thoughts for more data-efficient pretraining, bootstrapping their capabilities on limited, task-agnostic data. We call this new paradigm “reasoning to learn”. arxiv.org/abs/2503.18866 Here’s how it works🧵

New paper on synthetic pretraining!

We show LMs can synthesize their own thoughts for more data-efficient pretraining, bootstrapping their capabilities on limited, task-agnostic data. We call this new paradigm “reasoning to learn”.
arxiv.org/abs/2503.18866

Here’s how it works🧵
Etash Guha @ ICLR (@etash_guha) 's Twitter Profile Photo

Turns out, it’s possible to outperform DeepSeekR1-32B with only SFT on open data and no RL: Announcing OpenThinker2-32B and OpenThinker2-7B. We also release the data, OpenThoughts2-1M, curated by selecting quality instructions from diverse sources. 🧵 (1/n)

Turns out, it’s possible to outperform DeepSeekR1-32B with only SFT on open data and no RL: Announcing OpenThinker2-32B and OpenThinker2-7B. We also release the data, OpenThoughts2-1M, curated by selecting quality instructions from diverse sources. 🧵 (1/n)
Stanford NLP Group (@stanfordnlp) 's Twitter Profile Photo

Want to learn the engineering details of building state-of-the-art Large Language Models (LLMs)? Not finding much info in OpenAI’s non-technical reports? Percy Liang and Tatsunori Hashimoto are here to help with CS336: Language Modeling from Scratch, now rolling out to YouTube.

Tatsunori Hashimoto (@tatsu_hashimoto) 's Twitter Profile Photo

I think CS336 has one of the best LLM problem sets of any AI/LM class thanks to our incredible TAs (Nelson Liu,Gabriel Poesia,Marcel Rød,Neil Band,Rohith Kuditipudi). We're making it so you can do it all at home, and it's one of the best ways to learn LLMs deeply.

Zitong Yang (@zitongyang0) 's Twitter Profile Photo

Synthetic Continued Pretraining (arxiv.org/pdf/2409.07431) has been accepted as an Oral Presentation at #ICLR2025! We tackle the challenge of data-efficient language model pretraining: how to teach an LM the knowledge of small, niche corpora, such as the latest arXiv preprints.

Synthetic Continued Pretraining (arxiv.org/pdf/2409.07431) has been accepted as an Oral Presentation at #ICLR2025!

We tackle the challenge of data-efficient language model pretraining: how to teach an LM the knowledge of small, niche corpora, such as the latest arXiv preprints.
Simon Guo 🦝 (@simonguozirui) 's Twitter Profile Photo

Designed some graphics for Stanford CS336 (Language Modeling from Scratch) by Percy Liang Tatsunori Hashimoto Marcel Rød Neil Band Rohith Kuditipudi Covering four assignments 📚 that teach you how to 🧑‍🍳 cook an LLM from scratch: - Build and Train a Tokenizer 🔤 - Write Triton kernels for

Designed some graphics for Stanford CS336 (Language Modeling from Scratch) by <a href="/percyliang/">Percy Liang</a> <a href="/tatsu_hashimoto/">Tatsunori Hashimoto</a> <a href="/marcelroed/">Marcel Rød</a>
<a href="/neilbband/">Neil Band</a> <a href="/rckpudi/">Rohith Kuditipudi</a>

Covering four assignments 📚 that teach you how to 🧑‍🍳 cook an LLM from scratch:
- Build and Train a Tokenizer 🔤
- Write Triton kernels for
Ryan Marten (@ryanmart3n) 's Twitter Profile Photo

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals.

We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data