Bethge Lab (@bethgelab) Twitter Tweets • TwiCopy

Marcel Binz

a year ago

Excited to announce Centaur -- the first foundation model of human cognition. Centaur can predict and simulate human behavior in any experiment expressible in natural language. You can readily download the model from Hugging Face and test it yourself: huggingface.co/marcelbinz/Lla…

thumb_up_off_alt1,1K

chat_bubble_outline41

repeat247

shareShare

Bethge Lab

@bethgelab

a year ago

Check out the latest work from our lab on how to merge your multimodal models over time? We find several exciting insights with implications for model merging, continual pretraining and distributed/federated training! Full thread below:

thumb_up_off_alt13

chat_bubble_outline0

repeat0

shareShare

Shashwat Goel

@shashwatgoel7

9 months ago

🚨Great Models Think Alike and this Undermines AI Oversight🚨 New paper quantifies LM similarity (1) LLM-as-a-judge favor more similar models🤥 (2) Complementary knowledge benefits Weak-to-Strong Generalization☯️ (3) More capable models have more correlated failures 📈🙀 🧵👇

thumb_up_off_alt138

chat_bubble_outline6

repeat29

shareShare

Andreas Hochlehnert

@ahochlehnert

9 months ago

CuratedThoughts: Data Curation for RL Datasets 🚀 Since DeepSeek-R1 introduced reasoning-based RL, datasets like Open-R1 & OpenThoughts emerged for fine-tuning & GRPO. Our deep dive found major flaws — 25% of OpenThoughts needed elimination by data curation. Here's why 👇🧵

thumb_up_off_alt39

chat_bubble_outline2

repeat11

shareShare

Bethge Lab

@bethgelab

9 months ago

Check out some cool data-centric analysis on reasoning datasets! More to come from our lab!

thumb_up_off_alt6

chat_bubble_outline0

repeat1

shareShare

Shiven Sinha

@shiven_sinha

9 months ago

AI can generate correct-seeming hypotheses (and papers!). Brandolini's law states BS is harder to refute than generate. Can LMs falsify incorrect solutions? o3-mini (high) scores just 9% on our new benchmark REFUTE. Verification is not necessarily easier than generation 🧵

thumb_up_off_alt152

chat_bubble_outline10

repeat37

shareShare

Bethge Lab

@bethgelab

9 months ago

Checkout this cool new work from Bethgelab & friends! Falsifying flawed solutions is key to science—but LMs aren't there yet. Even advanced models produce counterexamples for <9% of mistakes, despite solving ~48% of problems. Full thread below:

thumb_up_off_alt16

chat_bubble_outline0

repeat1

shareShare

Çağatay Yıldız

@cgtyyldz

9 months ago

For our "Automated Assessment of Teaching Quality" project, we are looking for two PhD students: one in educational/cognitive sciences or a related field (uni-tuebingen.de/fakultaeten/wi…) and one in machine learning (uni-tuebingen.de/en/faculties/f…). Please apply and reach me out for details!

thumb_up_off_alt8

chat_bubble_outline2

repeat7

shareShare

Lukas Thede

@lukas_thede

7 months ago

🧠 Keeping LLMs factually up to date is a common motivation for knowledge editing. But what would it actually take to support this in practice at the scale and speed the real world demands? We explore this question and really push the limits of lifelong knowledge editing. 👇

thumb_up_off_alt22

chat_bubble_outline1

repeat6

shareShare

Bethge Lab

@bethgelab

7 months ago

Recent work from our lab trying to ask questions on how to fairly evaluate and measure progress in language model reasoning! Check out the full thread below!

thumb_up_off_alt18

chat_bubble_outline0

repeat2

shareShare

Ori Press

@ori_press

5 months ago

Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

thumb_up_off_alt141

chat_bubble_outline6

repeat54

shareShare

Bethge Lab

@bethgelab

5 months ago

🧠🤖 We’re hiring a Postdoc in NeuroAI! Join CRC1233 "Robust Vision" (Uni Tübingen) to build benchmarks & evaluation methods for vision models, bridging brain & AI. Work with top faculty & shape vision research. Apply: tinyurl.com/3jtb4an6 #NeuroAI #Jobs

thumb_up_off_alt8

chat_bubble_outline0

repeat2

shareShare

Adhiraj Ghosh

@adhiraj_ghosh98

4 months ago

Excited to be in Vienna for #ACL2025🇦🇹! You'll find Sebastian Dziadzio and I by our ONEBench poster, so do drop by! 🗓️Wed, July 30, 11-12:30 CET 📍Hall 4/5 I’m also excited to talk about lifelong and personalised benchmarking, data curation and vision-language in general! Let’s connect!