Neil Band (@neilbband) Twitter Tweets • TwiCopy

Aran Komatsuzaki

a year ago

Synthetic continued pretraining Proposes to bridge the sample-inefficiency of pretraining with synthetic continued pretraining: continued pretraining on a large corpus synthetically generated from a small domain-specific corpus arxiv.org/abs/2409.07431

thumb_up_off_alt256

chat_bubble_outline7

repeat50

shareShare

Zitong Yang

@zitongyang0

a year ago

Grab your favorite preprint of the week: how can you put its knowledge in your LM’s parameters? Continued pretraining (CPT) works well with >10B tokens, but the preprint is <10K. Synthetic CPT downscales CPT to such small, targeted domains. 📜: arxiv.org/abs/2409.07431 🧵👇

thumb_up_off_alt148

chat_bubble_outline3

repeat37

shareShare

Neil Band

@neilbband

a year ago

Really enjoyed working on Synthetic Continued Pretraining with Zitong Yang*, Shuangping Li, Emmanuel Candes, and Tatsunori Hashimoto! A simple approach to continue pretraining (CPT) on as little as ~1M tokens: Small specialized corpus -> synthesize a large, diverse corpus -> CPT

thumb_up_off_alt25

chat_bubble_outline0

repeat3

shareShare

Jon Saad-Falcon

@jonsaadfalcon

a year ago

What is the best way to spend your inference compute budget to create LLM systems greater than the sum of their parts? In our latest paper, we present Archon, an architecture search framework for inference-time techniques! Archon is enabled by inference-time architecture search

thumb_up_off_alt185

chat_bubble_outline8

repeat54

shareShare

Neil Band

@neilbband

a year ago

A CoPilot for tutors + a rigorous study with 1,800 students showing it actually helps them master concepts better 🔥 Awesome work by Rose and team!

thumb_up_off_alt11

chat_bubble_outline0

repeat2

shareShare

Tanishq Kumar

@tanishqkumar07

10 months ago

[1/7] New paper alert! Heard about the BitNet hype or that Llama-3 is harder to quantize? Our new work studies both! We formulate scaling laws for precision, across both pre and post-training arxiv.org/pdf/2411.04330. TLDR; - Models become harder to post-train quantize as they

thumb_up_off_alt854

chat_bubble_outline21

repeat160

shareShare

Nicole Meister

@nicole__meister

10 months ago

Prior work has used LLMs to simulate survey responses, yet their ability to match the distribution of views remains uncertain. Our new paper [arxiv.org/pdf/2411.05403] introduces a benchmark to evaluate how distributionally aligned LLMs are with human opinions. 🧵

thumb_up_off_alt160

chat_bubble_outline4

repeat37

shareShare

Etash Guha @ ICLR

@etash_guha

10 months ago

I’m super happy to have co-led the Evalchemy 🧪 team! I’ve been personally wanting a simple and fast framework for running a host of common post-training evals 📚 for a while now and this tool has streamlined a lot of my research. We hope that it helps you out too!

thumb_up_off_alt33

chat_bubble_outline2

repeat11

shareShare

Teddi Worledge

@teddiworledge

9 months ago

🧵LLMs are great at synthesizing info, but unreliable at citing sources. Search engines are the opposite. What lies between them? Our new paper runs human evals on 7 systems across the✨extractive-abstractive spectrum✨for utility, citation quality, time-to-verify, & fluency!

thumb_up_off_alt65

chat_bubble_outline1

repeat21

shareShare

Jaya Gupta

@jayagup10

9 months ago

Services as Software Part 3 is here. And it's not about opportunity, it's about extinction. Why? Services as Software isn't just a $4.6T opportunity. It's the end of software as we know it. The $300B enterprise software industry is facing its asteroid moment. SAP, Oracle ,

thumb_up_off_alt322

chat_bubble_outline11

repeat52

shareShare

Niklas Muennighoff

@muennighoff

7 months ago

DeepSeek r1 is exciting but misses OpenAI’s test-time scaling plot and needs lots of data. We introduce s1 reproducing o1-preview scaling & performance with just 1K samples & a simple test-time intervention. 📜arxiv.org/abs/2501.19393

thumb_up_off_alt990

chat_bubble_outline41

repeat188

shareShare

Tengyu Ma

@tengyuma

7 months ago

RL + CoT works great for DeepSeek-R1 & o1, but: 1️⃣ Linear-in-log scaling in train & test-time compute 2️⃣ Likely bounded by difficulty of training problems Meet STP—a self-play algorithm that conjectures & proves indefinitely, scaling better! 🧠⚡🧵🧵 arxiv.org/abs/2502.00212

thumb_up_off_alt556

chat_bubble_outline17

repeat108

shareShare

Tanishq Mathew Abraham, Ph.D.

@iscienceluvr

5 months ago

Reasoning to Learn from Latent Thoughts "Motivated by how humans apply deliberate thinking to learn from limited data, we train an LM to infer (or “decompress”) latent thoughts underlying the highly compressed observed data. These synthesized latent thoughts augment the raw

thumb_up_off_alt640

chat_bubble_outline12

repeat114

shareShare

Yangjun Ruan

@yangjunr

5 months ago

New paper on synthetic pretraining! We show LMs can synthesize their own thoughts for more data-efficient pretraining, bootstrapping their capabilities on limited, task-agnostic data. We call this new paradigm “reasoning to learn”. arxiv.org/abs/2503.18866 Here’s how it works🧵

thumb_up_off_alt479

chat_bubble_outline14

repeat100

shareShare

Etash Guha @ ICLR

@etash_guha

5 months ago

Turns out, it’s possible to outperform DeepSeekR1-32B with only SFT on open data and no RL: Announcing OpenThinker2-32B and OpenThinker2-7B. We also release the data, OpenThoughts2-1M, curated by selecting quality instructions from diverse sources. 🧵 (1/n)

thumb_up_off_alt465

chat_bubble_outline19

repeat171

shareShare

Stanford NLP Group

@stanfordnlp

5 months ago

Want to learn the engineering details of building state-of-the-art Large Language Models (LLMs)? Not finding much info in OpenAI’s non-technical reports? Percy Liang and Tatsunori Hashimoto are here to help with CS336: Language Modeling from Scratch, now rolling out to YouTube.

thumb_up_off_alt1,1K

chat_bubble_outline10

repeat156

shareShare

Tatsunori Hashimoto

@tatsu_hashimoto

5 months ago

I think CS336 has one of the best LLM problem sets of any AI/LM class thanks to our incredible TAs (Nelson Liu,Gabriel Poesia,Marcel Rød,Neil Band,Rohith Kuditipudi). We're making it so you can do it all at home, and it's one of the best ways to learn LLMs deeply.

thumb_up_off_alt729

chat_bubble_outline9

repeat62

shareShare

Zitong Yang

@zitongyang0

4 months ago

Synthetic Continued Pretraining (arxiv.org/pdf/2409.07431) has been accepted as an Oral Presentation at #ICLR2025! We tackle the challenge of data-efficient language model pretraining: how to teach an LM the knowledge of small, niche corpora, such as the latest arXiv preprints.

thumb_up_off_alt82

chat_bubble_outline1

repeat12

shareShare

Simon Guo 🦝

@simonguozirui

3 months ago

Designed some graphics for Stanford CS336 (Language Modeling from Scratch) by Percy Liang Tatsunori Hashimoto Marcel Rød Neil Band Rohith Kuditipudi Covering four assignments 📚 that teach you how to 🧑‍🍳 cook an LLM from scratch: - Build and Train a Tokenizer 🔤 - Write Triton kernels for

Designed some graphics for Stanford CS336 (Language Modeling from Scratch) by <a href="/percyliang/">Percy Liang</a> <a href="/tatsu_hashimoto/">Tatsunori Hashimoto</a> <a href="/marcelroed/">Marcel Rød</a>
<a href="/neilbband/">Neil Band</a> <a href="/rckpudi/">Rohith Kuditipudi</a>

Covering four assignments 📚 that teach you how to 🧑‍🍳 cook an LLM from scratch:
- Build and Train a Tokenizer 🔤
- Write Triton kernels for

thumb_up_off_alt625

chat_bubble_outline9

repeat57

shareShare

Ryan Marten

@ryanmart3n

3 months ago

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data

thumb_up_off_alt880

chat_bubble_outline27

repeat181

shareShare