Josh Vendrow (@josh_vendrow) 's Twitter Profile
Josh Vendrow

@josh_vendrow

CS PhD Student @ MIT, interested in safe and reliable machine learning. Advised by @aleks_madry.

ID: 1599432101648125952

linkhttp://joshvendrow.com calendar_today04-12-2022 15:55:27

91 Tweet

283 Followers

329 Following

Josh Vendrow (@josh_vendrow) 's Twitter Profile Photo

I think it's great signal that adding reasoning helps a lot with reliability for relatively simpler math. e.g., on MMLU High School Math, for Claude 3.7 Sonnet, thinking more leads to: 14 errors -> 3 errors. Most reasoning model signal has been on competition (very hard) math.

Eddie Vendrow (@edwardvendrow) 's Twitter Profile Photo

Very excited to share *GSM8K-Platinum*, a revised version of the GSM8K test set! If you’re using GSM8K, I highly recommend you switch to GSM8K-Platinum! We built it as a drop-in replacement for the GSM8K test set. Check it out: huggingface.co/datasets/madry…

Josh Vendrow (@josh_vendrow) 's Twitter Profile Photo

Excited to share GSM8K-Platinum! We show that benchmark quality is crucial for understanding LLM performance. Benchmark noise matters—a lot!šŸ‘‡ Great effort led by Eddie Vendrow. We also built it as a direct drop-in for the GSM8K test set! šŸ“ Blog post: gradientscience.org/gsm8k-platinum

Logan Engstrom (@logan_engstrom) 's Twitter Profile Photo

Want state-of-the-art data curation, data poisoning & more? Just do gradient descent! w/ Andrew Ilyas Ben Chen Axel Feldmann Billy Moses Aleksander Madry: we show how to optimize final model loss wrt any continuous variable. Key idea: Metagradients (grads through model training)

Want state-of-the-art data curation, data poisoning & more? Just do gradient descent!

w/ <a href="/andrew_ilyas/">Andrew Ilyas</a> Ben Chen <a href="/axel_s_feldmann/">Axel Feldmann</a>  <a href="/wsmoses/">Billy Moses</a> <a href="/aleks_madry/">Aleksander Madry</a>: we show how to optimize final model loss wrt any continuous variable.

Key idea: Metagradients (grads through model training)
Josh Vendrow (@josh_vendrow) 's Twitter Profile Photo

Really amazing work! I'm really excited that the community is moving towards not just coming up with "harder" benchmarks, but evaluations that actually tell us how/why models fail (and when it's actually the benchmark's fault).

Sarah Schwettmann (@cogconfluence) 's Twitter Profile Photo

I’m excited about Docent. It invites a world where AI evals & deployment decisions look less like: ā€œdid we pass threshold Xā€ and more like: ā€œhow close did we come? how would changes in the agent or its environment have changed the outcome? ...did anything weird happen?ā€

idan shenfeld (@idanshenfeld) 's Twitter Profile Photo

The next frontier for AI shouldn’t just be generally helpful. It should be helpful for you! Our new paper shows how to personalize LLMs — efficiently, scalably, and without retraining. Meet PReF (arxiv.org/abs/2503.06358) 1\n

Josh Vendrow (@josh_vendrow) 's Twitter Profile Photo

I've been trying out the arc-agi v2 puzzles, this one is absolutely nuts arcprize.org/play?task=0934… (puzzle #10 on v1, which is also #6 on v2)

Qubitum (@qubitium) 's Twitter Profile Photo

GSM8K Platinum now merged into LM-Eval. Platinum is a cleaned up version of GSM8K dataset to reduce errors caused by dataset itself. Results in scores will be ~10% higher than normal GSM8. github.com/EleutherAI/lm-…

Josh Vendrow (@josh_vendrow) 's Twitter Profile Photo

2024: Let’s benchmark LLMs using 15 problems (AIME). 2025: Let’s benchmark LLMs using 6 problems (USAMO). I have a worrying scaling law for LLM benchmarks.

2024: Let’s benchmark LLMs using 15 problems (AIME).

2025: Let’s benchmark LLMs using 6 problems (USAMO).

I have a worrying scaling law for LLM benchmarks.
Eddie Vendrow (@edwardvendrow) 's Twitter Profile Photo

šŸš€ We ran the latest frontier LLMs on PlatinumBench, our benchmark designed to measure reliability on cleaned benchmarks. New models include GPT-4.5, Gemini 2.5 Pro, Llama 4, and Grok 3. Turns out, these models still make simple errors! Thread belowšŸ‘‡Ā (1/4)

šŸš€ We ran the latest frontier LLMs on PlatinumBench, our benchmark designed to measure reliability on cleaned benchmarks. New models include GPT-4.5, Gemini 2.5 Pro, Llama 4, and Grok 3.

Turns out, these models still make simple errors!

Thread belowšŸ‘‡Ā (1/4)
Josh Vendrow (@josh_vendrow) 's Twitter Profile Photo

Any time I use o3 its reasoning keeps saying "Joshua wants to...", "Joshua has asked me to...", etc. is anyone else getting something like this?

Any time I use o3 its reasoning keeps saying "Joshua wants to...", "Joshua has asked me to...", etc. is anyone else getting something like this?
Ben Cohen-Wang (@bcohenwang) 's Twitter Profile Photo

It can be helpful to pinpoint the in-context information that a language model uses when generating content (is it using provided documents? or its own intermediate thoughts?). We present Attribution with Attention (AT2), a method for doing so efficiently and reliably! (1/8)

It can be helpful to pinpoint the in-context information that a language model uses when generating content (is it using provided documents? or its own intermediate thoughts?). We present Attribution with Attention (AT2), a method for doing so efficiently and reliably! (1/8)
Victor Butoi (@ion_barrel) 's Twitter Profile Photo

This summer I'm venturing out of the Medical Imaging world into the Self-driving one as an intern at Waymo! I've never lived long term in SF, and am excited to make new friends šŸ˜„

Josh Vendrow (@josh_vendrow) 's Twitter Profile Photo

Eddie and I first discovered this behavior suddenly appear in the middle of a math problem—using the error viewer we created for Platinum Benchmarks: platinum-bench.csail.mit.edu/inspect?model=… We then realized we could reproduce this behavior directly across models!

Eddie and I first discovered this behavior suddenly appear in the middle of a math problem—using the error viewer we created for Platinum Benchmarks: platinum-bench.csail.mit.edu/inspect?model=…

We then realized we could reproduce this behavior directly across models!
Giannis Daras (@giannis_daras) 's Twitter Profile Photo

Announcing Ambient Diffusion Omni — a framework that uses synthetic, low-quality, and out-of-distribution data to improve diffusion models. State-of-the-art ImageNet performance. A strong text-to-image results in just 2 days on 8 GPUs. Filtering āŒ Clever data use āœ…

Announcing Ambient Diffusion Omni — a framework that uses synthetic, low-quality, and out-of-distribution data to improve diffusion models.

State-of-the-art ImageNet performance. A strong text-to-image results in just 2 days on 8 GPUs.

Filtering āŒ
Clever data use āœ…
Andrew Ilyas (@andrew_ilyas) 's Twitter Profile Photo

ā€œHow will my model behave if I change the training data?ā€ Recent(-ish) work w/ Logan Engstrom: we nearly *perfectly* predict ML model behavior as a function of training data, saturating benchmarks for this problem (called ā€œdata attributionā€).

ā€œHow will my model behave if I change the training data?ā€

Recent(-ish) work w/ <a href="/logan_engstrom/">Logan Engstrom</a>: we nearly *perfectly* predict ML model behavior as a function of training data, saturating benchmarks for this problem (called ā€œdata attributionā€).