Brian Bartoldson (@bartoldson) 's Twitter Profile
Brian Bartoldson

@bartoldson

ML researcher

ID: 783700916742524929

linkhttp://brianbartoldson.wordpress.com calendar_today05-10-2016 16:10:33

225 Tweet

290 Followers

461 Following

Aleksander Madry (@aleks_madry) 's Twitter Profile Photo

GSM8K has been a cornerstone benchmark for LLMs, but performance seemed stuck around 95%. Why? Turns out, the benchmark itself was noisy. We fixed that, and found that it significantly affects evals. Introducing GSM8K-Platinum! w/Eddie Vendrow Josh Vendrow Sara Beery

GSM8K has been a cornerstone benchmark for LLMs, but performance seemed stuck around 95%. Why?

Turns out, the benchmark itself was noisy. We fixed that, and found that it significantly affects evals.

Introducing GSM8K-Platinum!

w/<a href="/EdwardVendrow/">Eddie Vendrow</a> <a href="/josh_vendrow/">Josh Vendrow</a> <a href="/sarameghanbeery/">Sara Beery</a>
𝚐𝔪𝟾𝚡𝚡𝟾 (@gm8xx8) 's Twitter Profile Photo

Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training TBA is a scalable RL system for LLM post-training that uses off-policy data and replay buffers with Trajectory Balance. It decouples training from search, improving speed

Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training

TBA is a scalable RL system for LLM post-training that uses off-policy data and replay buffers with Trajectory Balance. It decouples training from search, improving speed
Bhavya Kailkhura (@bkailkhu) 's Twitter Profile Photo

At Lawrence Livermore National Laboratory, we are using AI to: ⚛️ Solve nuclear fusion 🧪 Discover critical materials 🧠 Red-team vulnerabilities All to push science forward and protect national security 🌎 Post-training LLMs at scale can unlock these advances. But even with El Capitan—the world’s

At <a href="/Livermore_Lab/">Lawrence Livermore National Laboratory</a>, we are using AI to:
⚛️ Solve nuclear fusion
🧪 Discover critical materials
🧠 Red-team vulnerabilities

All to push science forward and protect national security 🌎

Post-training LLMs at scale can unlock these advances. But even with El Capitan—the world’s
Yangjun Ruan (@yangjunr) 's Twitter Profile Photo

New paper on synthetic pretraining! We show LMs can synthesize their own thoughts for more data-efficient pretraining, bootstrapping their capabilities on limited, task-agnostic data. We call this new paradigm “reasoning to learn”. arxiv.org/abs/2503.18866 Here’s how it works🧵

New paper on synthetic pretraining!

We show LMs can synthesize their own thoughts for more data-efficient pretraining, bootstrapping their capabilities on limited, task-agnostic data. We call this new paradigm “reasoning to learn”.
arxiv.org/abs/2503.18866

Here’s how it works🧵
fly51fly (@fly51fly) 's Twitter Profile Photo

[LG] Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training B R. Bartoldson, S Venkatraman, J Diffenderfer, M Jain... [Lawrence Livermore National Laboratory & Mila] (2025) arxiv.org/abs/2503.18929

[LG] Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training
B R. Bartoldson, S Venkatraman, J Diffenderfer, M Jain... [Lawrence Livermore National Laboratory &amp; Mila] (2025)
arxiv.org/abs/2503.18929
Cihang Xie (@cihangxie) 's Twitter Profile Photo

🚨 Interested in adopting Large Reasoning Models (LRMs) but concerned about safety risks? 🚨 Meet STAR-1 🌟 – A compact, high-quality safety dataset (just 1K samples!) boosting LRMs' safety by 40% with only a minimal (~1.1%) reasoning drop! 🚀 How we built STAR-1 in just 3

🚨 Interested in adopting Large Reasoning Models (LRMs) but concerned about safety risks? 🚨

Meet STAR-1 🌟 – A compact, high-quality safety dataset (just 1K samples!) boosting LRMs' safety by 40% with only a minimal (~1.1%) reasoning drop! 🚀

How we built STAR-1 in just 3
Cihang Xie (@cihangxie) 's Twitter Profile Photo

🚨Concerned about visual jailbreaking attacks holding back Vision-Language Model (VLM) deployment? 🌟 Excited to announce our latest research: Double Visual Defense! TL;DR: We introduce ΔCLIP and Δ²LLaVA — the first to reconcile robust adversarial performance with

🚨Concerned about visual jailbreaking attacks holding back Vision-Language Model (VLM) deployment?

🌟 Excited to announce our latest research: Double Visual Defense!

TL;DR: We introduce ΔCLIP and Δ²LLaVA — the first to reconcile robust adversarial performance with
Brian Bartoldson (@bartoldson) 's Twitter Profile Photo

🚀 The code for LLM post-training with TBA is now available! Try out Trajectory Balance with Asynchrony via github.com/bbartoldson/TBA. x.com/bartoldson/sta…

Johan S. Obando 👍🏽 (@johanobandoc) 's Twitter Profile Photo

🥳Come chat with Brian Bartoldson and Moksh Jain at our TBA poster at the #ICLR25 workshop on Open Science for Foundation Models (SCI-FM). The workshop will be held in EXPO Hall 4 #5 on Monday, April 28th.

🥳Come chat with <a href="/bartoldson/">Brian Bartoldson</a> and <a href="/JainMoksh/">Moksh Jain</a> at our TBA poster at the #ICLR25 workshop on Open Science for Foundation Models (SCI-FM). The workshop will be held in EXPO Hall 4 #5 on Monday, April 28th.
EleutherAI (@aieleuther) 's Twitter Profile Photo

Can you train a performant language models without using unlicensed text? We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&2

Can you train a performant language models without using unlicensed text?

We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&amp;2
Brian Bartoldson (@bartoldson) 's Twitter Profile Photo

Here's a free/gift link to the Washington Post article about training LLMs on openly licensed text: wapo.st/3T94IfQ. x.com/AiEleuther/sta…

Infini-AI-Lab (@infiniailab) 's Twitter Profile Photo

🚀 Excited to introduce our latest work GRESO: a method that identifies and skips millions of uninformative prompts before rollout and achieves up to 2.0x wall-clock time speedup in training. More rollouts lead to better model performance, but they’re also a major bottleneck in

🚀 Excited to introduce our latest work GRESO: a method that identifies and skips millions of uninformative prompts before rollout and achieves up to 2.0x wall-clock time speedup in training.

More rollouts lead to better model performance, but they’re also a major bottleneck in