joschkastrueber (@joschkastrueber) 's Twitter Profile
joschkastrueber

@joschkastrueber

ID: 1887901452112384000

calendar_today07-02-2025 16:29:46

14 Tweet

8 Followers

26 Following

Armielyn Obinguar (@aeriumcius) 's Twitter Profile Photo

“Great Models Think Alike and this Undermines AI Oversight,” shows that as language models improve, they start making more similar mistakes. The authors introduce a metric called CAPA that measures how much models agree on their errors—beyond what you'd expect by chance. They

“Great Models Think Alike and this Undermines AI Oversight,” shows that as language models improve, they start making more similar mistakes.

The authors introduce a metric called CAPA that measures how much models agree on their errors—beyond what you'd expect by chance.

They
Shashwat Goel (@shashwatgoel7) 's Twitter Profile Photo

🚨Great Models Think Alike and this Undermines AI Oversight🚨 New paper quantifies LM similarity (1) LLM-as-a-judge favor more similar models🤥 (2) Complementary knowledge benefits Weak-to-Strong Generalization☯️ (3) More capable models have more correlated failures 📈🙀 🧵👇

🚨Great Models Think Alike and this Undermines AI Oversight🚨
New paper quantifies LM similarity
(1) LLM-as-a-judge favor more similar models🤥
(2) Complementary knowledge benefits Weak-to-Strong Generalization☯️
(3) More capable models have more correlated failures 📈🙀
🧵👇
Eric W. Tramel (@fujikanaeda) 's Twitter Profile Photo

This is a really interesting study — looking forward to diving in more, but I was thinking about this claim of strong models similarity. Shouldn’t it be expected that models should converge to making the “same” predictions? This is the Platonic representation hypothesis — if

Séb Krier (@sebkrier) 's Twitter Profile Photo

Cool paper! Judge models systematically rate models higher when they share error patterns. Weak-to-strong learning works better when the supervisor and student have uncorrelated mistakes. But as model performance improves, error patterns converge across different architectures

Cool paper! Judge models systematically rate models higher when they share error patterns. Weak-to-strong learning works better when the supervisor and student have uncorrelated mistakes. But as model performance improves, error patterns converge across different architectures
Jonas Geiping (@jonasgeiping) 's Twitter Profile Photo

Ok, so I can finally talk about this! We spent the last year (actually a bit longer) training an LLM with recurrent depth at scale. The model has an internal latent space in which it can adaptively spend more compute to think longer. I think the tech report ...🐦‍⬛

Ok, so I can finally talk about this! 

We spent the last year (actually  a bit longer) training an  LLM with recurrent depth at scale.

The model has an internal latent space in which it can adaptively spend more compute to think longer. 

I think the tech report ...🐦‍⬛
TuringPost (@theturingpost) 's Twitter Profile Photo

The freshest AI/ML research of the week: Our top 4 ▪️ AlphaGeometry2 ▪️ ZebraLogic ▪️ Limo: Less is More for Reasoning ▪️ Great Models Think Alike and this Undermines AI Oversight ▪️ Activation-Informed Merging of LLMs ▪️ Content-Format Integrated Prompt Optimization (CFPO) ▪️

The freshest AI/ML research of the week:

Our top 4
▪️ AlphaGeometry2
▪️ ZebraLogic
▪️ Limo: Less is More for Reasoning
▪️ Great Models Think Alike and this Undermines AI Oversight

▪️ Activation-Informed Merging of LLMs
▪️ Content-Format Integrated Prompt Optimization (CFPO)
▪️
Andreas Hochlehnert (@ahochlehnert) 's Twitter Profile Photo

CuratedThoughts: Data Curation for RL Datasets 🚀 Since DeepSeek-R1 introduced reasoning-based RL, datasets like Open-R1 & OpenThoughts emerged for fine-tuning & GRPO. Our deep dive found major flaws — 25% of OpenThoughts needed elimination by data curation. Here's why 👇🧵

Prasanna Mayilvahanan @ICLR2025 (@prasannamayil) 's Twitter Profile Photo

New preprint out! 🎉🎉 How does LLM training loss translate to downstream performance? We show that pretraining data and tokenizer shape loss-to-loss scaling laws, while architecture and other factors play a surprisingly minor role! brendel-group.github.io/llm-line/ 🧵1/8

New preprint out! 🎉🎉
How does LLM training loss translate to downstream performance?

We show that pretraining data and tokenizer shape loss-to-loss scaling laws, while architecture and other factors play a surprisingly minor role! brendel-group.github.io/llm-line/ 
🧵1/8
Shashwat Goel (@shashwatgoel7) 's Twitter Profile Photo

I pieced together this first-principles no RL prerequisites explainer on how RL for LLMs works, and why we need it🧵 The main point? RL is exciting because it allows us to scale supervision. We can now learn from just rewards instead of demonstrations. Lets use this framing...

I pieced together this first-principles no RL prerequisites explainer on how RL for LLMs works, and why we need it🧵

The main point? RL is exciting because it allows us to scale supervision. We can now learn from just rewards instead of demonstrations. Lets use this framing...
Federico D'Agostino (@fededagos) 's Twitter Profile Photo

🚨 New paper alert! 🚨 We’ve just launched openretina, an open-source framework for collaborative retina modeling across datasets and species. A 🧵👇 (1/9)

🚨 New paper alert! 🚨
We’ve just launched openretina, an open-source framework for collaborative retina modeling across datasets and species.
A 🧵👇 (1/9)
Lukas Thede (@lukas_thede) 's Twitter Profile Photo

🧠 Keeping LLMs factually up to date is a common motivation for knowledge editing. But what would it actually take to support this in practice at the scale and speed the real world demands? We explore this question and really push the limits of lifelong knowledge editing. 👇

🧠 Keeping LLMs factually up to date is a common motivation for knowledge editing.

But what would it actually take to support this in practice at the scale and speed the real world demands?

We explore this question and really push the limits of lifelong knowledge editing.
👇
Andreas Hochlehnert (@ahochlehnert) 's Twitter Profile Photo

🧵1/ 🚨 New paper: A Sober Look at Progress in Language Model Reasoning We re-evaluate recent SFT and RL models for mathematical reasoning and find most gains vanish under rigorous, multi-seed, standardized evaluation. 📊 bethgelab.github.io/sober-reasonin… 📄 arxiv.org/abs/2504.07086

🧵1/ 🚨 New paper: A Sober Look at Progress in Language Model Reasoning
We re-evaluate recent SFT and RL models for mathematical reasoning and find most gains vanish under rigorous, multi-seed, standardized evaluation.

📊 bethgelab.github.io/sober-reasonin…
📄 arxiv.org/abs/2504.07086
Vishaal Udandarao (@vishaal_urao) 's Twitter Profile Photo

🚀New Paper! arxiv.org/abs/2504.07086 Everyone’s celebrating rapid progress in math reasoning with RL/SFT. But how real is this progress? We re-evaluated recently released popular reasoning models—and found reported gains often vanish under rigorous testing!! 👀 🧵👇

🚀New Paper!
arxiv.org/abs/2504.07086

Everyone’s celebrating rapid progress in math reasoning with RL/SFT. But how real is this progress?

We re-evaluated recently released popular reasoning models—and found reported gains often vanish under rigorous testing!! 👀

🧵👇
Shashwat Goel (@shashwatgoel7) 's Twitter Profile Photo

Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇

Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇
Ori Press (@ori_press) 's Twitter Profile Photo

Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

Do language models have algorithmic creativity?

To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️
Shashwat Goel (@shashwatgoel7) 's Twitter Profile Photo

There's been a hole at the heart of #LLM evals, and we can now fix it. 📜New paper: Answer Matching Outperforms Multiple Choice for Language Model Evaluations. ❗️We found MCQs can be solved without even knowing the question. Looking at just the choices helps guess the answer

There's been a hole at the heart of #LLM evals, and we can now fix it.

📜New paper: Answer Matching Outperforms Multiple Choice for Language Model Evaluations.

❗️We found MCQs can be solved without even knowing the question. Looking at just the choices helps guess the answer
Shashwat Goel (@shashwatgoel7) 's Twitter Profile Photo

Presenting today at #ICML2025. To learn how to measure language model similarity, and it's effects on LLM as a Judge and Weak to Strong distillation, join our poster session: Today 11 am -1:30 pm, East Exhibition Hall A-B E-2411 w/ Ameya P. @ ICML 2025 joschkastrueber Ilze Amanda Auzina