joschkastrueber (@joschkastrueber) Twitter Tweets • TwiCopy

Armielyn Obinguar

9 months ago

“Great Models Think Alike and this Undermines AI Oversight,” shows that as language models improve, they start making more similar mistakes. The authors introduce a metric called CAPA that measures how much models agree on their errors—beyond what you'd expect by chance. They

thumb_up_off_alt10

chat_bubble_outline0

repeat4

shareShare

Shashwat Goel

@shashwatgoel7

9 months ago

🚨Great Models Think Alike and this Undermines AI Oversight🚨 New paper quantifies LM similarity (1) LLM-as-a-judge favor more similar models🤥 (2) Complementary knowledge benefits Weak-to-Strong Generalization☯️ (3) More capable models have more correlated failures 📈🙀 🧵👇

thumb_up_off_alt138

chat_bubble_outline6

repeat29

shareShare

Eric W. Tramel

@fujikanaeda

9 months ago

This is a really interesting study — looking forward to diving in more, but I was thinking about this claim of strong models similarity. Shouldn’t it be expected that models should converge to making the “same” predictions? This is the Platonic representation hypothesis — if

thumb_up_off_alt1

chat_bubble_outline1

repeat1

shareShare

Séb Krier

@sebkrier

9 months ago

Cool paper! Judge models systematically rate models higher when they share error patterns. Weak-to-strong learning works better when the supervisor and student have uncorrelated mistakes. But as model performance improves, error patterns converge across different architectures

thumb_up_off_alt15

chat_bubble_outline0

repeat4

shareShare

Jonas Geiping

@jonasgeiping

9 months ago

Ok, so I can finally talk about this! We spent the last year (actually a bit longer) training an LLM with recurrent depth at scale. The model has an internal latent space in which it can adaptively spend more compute to think longer. I think the tech report ...🐦‍⬛

thumb_up_off_alt2,2K

chat_bubble_outline51

repeat200

shareShare

TuringPost

@theturingpost

9 months ago

The freshest AI/ML research of the week: Our top 4 ▪️ AlphaGeometry2 ▪️ ZebraLogic ▪️ Limo: Less is More for Reasoning ▪️ Great Models Think Alike and this Undermines AI Oversight ▪️ Activation-Informed Merging of LLMs ▪️ Content-Format Integrated Prompt Optimization (CFPO) ▪️

thumb_up_off_alt39

chat_bubble_outline1

repeat7

shareShare

Andreas Hochlehnert

@ahochlehnert

8 months ago

CuratedThoughts: Data Curation for RL Datasets 🚀 Since DeepSeek-R1 introduced reasoning-based RL, datasets like Open-R1 & OpenThoughts emerged for fine-tuning & GRPO. Our deep dive found major flaws — 25% of OpenThoughts needed elimination by data curation. Here's why 👇🧵

thumb_up_off_alt39

chat_bubble_outline2

repeat11

shareShare

Prasanna Mayilvahanan @ICLR2025

@prasannamayil

8 months ago

New preprint out! 🎉🎉 How does LLM training loss translate to downstream performance? We show that pretraining data and tokenizer shape loss-to-loss scaling laws, while architecture and other factors play a surprisingly minor role! brendel-group.github.io/llm-line/ 🧵1/8

thumb_up_off_alt132

chat_bubble_outline2

repeat28

shareShare

Shashwat Goel

@shashwatgoel7

8 months ago

I pieced together this first-principles no RL prerequisites explainer on how RL for LLMs works, and why we need it🧵 The main point? RL is exciting because it allows us to scale supervision. We can now learn from just rewards instead of demonstrations. Lets use this framing...

thumb_up_off_alt33

chat_bubble_outline2

repeat7

shareShare

Shiven Sinha

@shiven_sinha

8 months ago

AI can generate correct-seeming hypotheses (and papers!). Brandolini's law states BS is harder to refute than generate. Can LMs falsify incorrect solutions? o3-mini (high) scores just 9% on our new benchmark REFUTE. Verification is not necessarily easier than generation 🧵

thumb_up_off_alt152

chat_bubble_outline10

repeat37

shareShare

Federico D'Agostino

@fededagos

7 months ago

🚨 New paper alert! 🚨 We’ve just launched openretina, an open-source framework for collaborative retina modeling across datasets and species. A 🧵👇 (1/9)

thumb_up_off_alt17

chat_bubble_outline1

repeat8

shareShare

Lukas Thede

@lukas_thede

7 months ago

🧠 Keeping LLMs factually up to date is a common motivation for knowledge editing. But what would it actually take to support this in practice at the scale and speed the real world demands? We explore this question and really push the limits of lifelong knowledge editing. 👇

thumb_up_off_alt22

chat_bubble_outline1

repeat6

shareShare

Andreas Hochlehnert

@ahochlehnert

7 months ago

🧵1/ 🚨 New paper: A Sober Look at Progress in Language Model Reasoning We re-evaluate recent SFT and RL models for mathematical reasoning and find most gains vanish under rigorous, multi-seed, standardized evaluation. 📊 bethgelab.github.io/sober-reasonin… 📄 arxiv.org/abs/2504.07086

thumb_up_off_alt50

chat_bubble_outline4

repeat13

shareShare

Vishaal Udandarao

@vishaal_urao

7 months ago

🚀New Paper! arxiv.org/abs/2504.07086 Everyone’s celebrating rapid progress in math reasoning with RL/SFT. But how real is this progress? We re-evaluated recently released popular reasoning models—and found reported gains often vanish under rigorous testing!! 👀 🧵👇

thumb_up_off_alt264

chat_bubble_outline4

repeat53

shareShare

Shashwat Goel

@shashwatgoel7

5 months ago

Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇

thumb_up_off_alt836

chat_bubble_outline33

repeat120

shareShare

Ori Press

@ori_press

4 months ago

Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

thumb_up_off_alt141

chat_bubble_outline6

repeat54

shareShare

Shashwat Goel

@shashwatgoel7

4 months ago

There's been a hole at the heart of #LLM evals, and we can now fix it. 📜New paper: Answer Matching Outperforms Multiple Choice for Language Model Evaluations. ❗️We found MCQs can be solved without even knowing the question. Looking at just the choices helps guess the answer

thumb_up_off_alt229

chat_bubble_outline11

repeat37

shareShare

Shashwat Goel

@shashwatgoel7

3 months ago

Presenting today at #ICML2025. To learn how to measure language model similarity, and it's effects on LLM as a Judge and Weak to Strong distillation, join our poster session: Today 11 am -1:30 pm, East Exhibition Hall A-B E-2411 w/ Ameya P. @ ICML 2025 joschkastrueber Ilze Amanda Auzina

thumb_up_off_alt14

chat_bubble_outline0

repeat2

shareShare