Josh Vendrow (@josh_vendrow) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Sam Seely

@samseely

5 months ago

Anthropic This reads like an innie's request to leave the severed floor

thumb_up_off_alt26

chat_bubble_outline0

repeat1

shareShare

I think it's great signal that adding reasoning helps a lot with reliability for relatively simpler math. e.g., on MMLU High School Math, for Claude 3.7 Sonnet, thinking more leads to: 14 errors -> 3 errors. Most reasoning model signal has been on competition (very hard) math.

thumb_up_off_alt7

chat_bubble_outline0

repeat1

shareShare

Eddie Vendrow

@edwardvendrow

4 months ago

Very excited to share *GSM8K-Platinum*, a revised version of the GSM8K test set! If you’re using GSM8K, I highly recommend you switch to GSM8K-Platinum! We built it as a drop-in replacement for the GSM8K test set. Check it out: huggingface.co/datasets/madry…

thumb_up_off_alt40

chat_bubble_outline1

repeat10

shareShare

Josh Vendrow

@josh_vendrow

4 months ago

Excited to share GSM8K-Platinum! We show that benchmark quality is crucial for understanding LLM performance. Benchmark noise matters—a lot!👇 Great effort led by Eddie Vendrow. We also built it as a direct drop-in for the GSM8K test set! 📝 Blog post: gradientscience.org/gsm8k-platinum

thumb_up_off_alt22

chat_bubble_outline0

repeat6

shareShare

Josh Vendrow

@josh_vendrow

4 months ago

Great to see GSM8K-Platinum already in use for important open-source LLM efforts!

thumb_up_off_alt6

chat_bubble_outline0

repeat1

shareShare

Logan Engstrom

@logan_engstrom

4 months ago

Want state-of-the-art data curation, data poisoning & more? Just do gradient descent! w/ Andrew Ilyas Ben Chen Axel Feldmann Billy Moses Aleksander Madry: we show how to optimize final model loss wrt any continuous variable. Key idea: Metagradients (grads through model training)

Want state-of-the-art data curation, data poisoning & more? Just do gradient descent!

w/ <a href="/andrew_ilyas/">Andrew Ilyas</a> Ben Chen <a href="/axel_s_feldmann/">Axel Feldmann</a> <a href="/wsmoses/">Billy Moses</a> <a href="/aleks_madry/">Aleksander Madry</a>: we show how to optimize final model loss wrt any continuous variable.

Key idea: Metagradients (grads through model training)

thumb_up_off_alt162

chat_bubble_outline9

repeat29

shareShare

Josh Vendrow

@josh_vendrow

4 months ago

Really amazing work! I'm really excited that the community is moving towards not just coming up with "harder" benchmarks, but evaluations that actually tell us how/why models fail (and when it's actually the benchmark's fault).

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

Sarah Schwettmann

@cogconfluence

4 months ago

I’m excited about Docent. It invites a world where AI evals & deployment decisions look less like: “did we pass threshold X” and more like: “how close did we come? how would changes in the agent or its environment have changed the outcome? ...did anything weird happen?”

thumb_up_off_alt42

chat_bubble_outline2

repeat7

shareShare

idan shenfeld

@idanshenfeld

4 months ago

The next frontier for AI shouldn’t just be generally helpful. It should be helpful for you! Our new paper shows how to personalize LLMs — efficiently, scalably, and without retraining. Meet PReF (arxiv.org/abs/2503.06358) 1\n

thumb_up_off_alt52

chat_bubble_outline2

repeat28

shareShare

Josh Vendrow

@josh_vendrow

4 months ago

I've been trying out the arc-agi v2 puzzles, this one is absolutely nuts arcprize.org/play?task=0934… (puzzle #10 on v1, which is also #6 on v2)

thumb_up_off_alt7

chat_bubble_outline1

repeat0

shareShare

Qubitum

@qubitium

3 months ago

GSM8K Platinum now merged into LM-Eval. Platinum is a cleaned up version of GSM8K dataset to reduce errors caused by dataset itself. Results in scores will be ~10% higher than normal GSM8. github.com/EleutherAI/lm-…

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

Josh Vendrow

@josh_vendrow

3 months ago

2024: Let’s benchmark LLMs using 15 problems (AIME). 2025: Let’s benchmark LLMs using 6 problems (USAMO). I have a worrying scaling law for LLM benchmarks.

thumb_up_off_alt21

chat_bubble_outline1

repeat1

shareShare

Eddie Vendrow

@edwardvendrow

3 months ago

🚀 We ran the latest frontier LLMs on PlatinumBench, our benchmark designed to measure reliability on cleaned benchmarks. New models include GPT-4.5, Gemini 2.5 Pro, Llama 4, and Grok 3. Turns out, these models still make simple errors! Thread below👇 (1/4)

thumb_up_off_alt8

chat_bubble_outline1

repeat2

shareShare

Josh Vendrow

@josh_vendrow

3 months ago

Any time I use o3 its reasoning keeps saying "Joshua wants to...", "Joshua has asked me to...", etc. is anyone else getting something like this?

thumb_up_off_alt2

chat_bubble_outline1

repeat0

shareShare

Ben Cohen-Wang

@bcohenwang

3 months ago

It can be helpful to pinpoint the in-context information that a language model uses when generating content (is it using provided documents? or its own intermediate thoughts?). We present Attribution with Attention (AT2), a method for doing so efficiently and reliably! (1/8)

thumb_up_off_alt56

chat_bubble_outline3

repeat13

shareShare

Victor Butoi

@ion_barrel

2 months ago

This summer I'm venturing out of the Medical Imaging world into the Self-driving one as an intern at Waymo! I've never lived long term in SF, and am excited to make new friends 😄

thumb_up_off_alt43

chat_bubble_outline4

repeat1

shareShare

Sushant Sachdeva

@sushnt

2 months ago

david rein See work from Alex Madry's lab led by Eddie Vendrow x.com/aleks_madry/st… Aleksander Madry

thumb_up_off_alt3

chat_bubble_outline0

repeat2

shareShare

Josh Vendrow

@josh_vendrow

2 months ago

Eddie and I first discovered this behavior suddenly appear in the middle of a math problem—using the error viewer we created for Platinum Benchmarks: platinum-bench.csail.mit.edu/inspect?model=… We then realized we could reproduce this behavior directly across models!

thumb_up_off_alt6

chat_bubble_outline0

repeat1

shareShare

Giannis Daras

@giannis_daras

a month ago

Announcing Ambient Diffusion Omni — a framework that uses synthetic, low-quality, and out-of-distribution data to improve diffusion models. State-of-the-art ImageNet performance. A strong text-to-image results in just 2 days on 8 GPUs. Filtering ❌ Clever data use ✅

thumb_up_off_alt416

chat_bubble_outline8

repeat55

shareShare

Andrew Ilyas

@andrew_ilyas

20 days ago

“How will my model behave if I change the training data?” Recent(-ish) work w/ Logan Engstrom: we nearly *perfectly* predict ML model behavior as a function of training data, saturating benchmarks for this problem (called “data attribution”).

“How will my model behave if I change the training data?”

Recent(-ish) work w/ <a href="/logan_engstrom/">Logan Engstrom</a>: we nearly *perfectly* predict ML model behavior as a function of training data, saturating benchmarks for this problem (called “data attribution”).

thumb_up_off_alt381

chat_bubble_outline10

repeat66

shareShare