Bethge Lab (@bethgelab) 's Twitter Profile
Bethge Lab

@bethgelab

Perceiving Neural Networks

ID: 882978663825846273

linkhttp://bethgelab.org calendar_today06-07-2017 15:04:53

308 Tweet

3,3K Followers

257 Following

Marcel Binz (@marcel_binz) 's Twitter Profile Photo

Excited to announce Centaur -- the first foundation model of human cognition. Centaur can predict and simulate human behavior in any experiment expressible in natural language. You can readily download the model from Hugging Face and test it yourself: huggingface.co/marcelbinz/Lla…

Bethge Lab (@bethgelab) 's Twitter Profile Photo

Check out the latest work from our lab on how to merge your multimodal models over time? We find several exciting insights with implications for model merging, continual pretraining and distributed/federated training! Full thread below:

Shashwat Goel (@shashwatgoel7) 's Twitter Profile Photo

🚨Great Models Think Alike and this Undermines AI Oversight🚨 New paper quantifies LM similarity (1) LLM-as-a-judge favor more similar models🤥 (2) Complementary knowledge benefits Weak-to-Strong Generalization☯️ (3) More capable models have more correlated failures 📈🙀 🧵👇

🚨Great Models Think Alike and this Undermines AI Oversight🚨
New paper quantifies LM similarity
(1) LLM-as-a-judge favor more similar models🤥
(2) Complementary knowledge benefits Weak-to-Strong Generalization☯️
(3) More capable models have more correlated failures 📈🙀
🧵👇
Andreas Hochlehnert (@ahochlehnert) 's Twitter Profile Photo

CuratedThoughts: Data Curation for RL Datasets 🚀 Since DeepSeek-R1 introduced reasoning-based RL, datasets like Open-R1 & OpenThoughts emerged for fine-tuning & GRPO. Our deep dive found major flaws — 25% of OpenThoughts needed elimination by data curation. Here's why 👇🧵

Shiven Sinha (@shiven_sinha) 's Twitter Profile Photo

AI can generate correct-seeming hypotheses (and papers!). Brandolini's law states BS is harder to refute than generate. Can LMs falsify incorrect solutions? o3-mini (high) scores just 9% on our new benchmark REFUTE. Verification is not necessarily easier than generation 🧵

AI can generate correct-seeming hypotheses (and papers!). Brandolini's law states BS is harder to refute than generate. Can LMs falsify incorrect solutions? o3-mini (high) scores just 9% on our new benchmark REFUTE. Verification is not necessarily easier than generation 🧵
Bethge Lab (@bethgelab) 's Twitter Profile Photo

Checkout this cool new work from Bethgelab & friends! Falsifying flawed solutions is key to science—but LMs aren't there yet. Even advanced models produce counterexamples for <9% of mistakes, despite solving ~48% of problems. Full thread below:

Çağatay Yıldız (@cgtyyldz) 's Twitter Profile Photo

For our "Automated Assessment of Teaching Quality" project, we are looking for two PhD students: one in educational/cognitive sciences or a related field (uni-tuebingen.de/fakultaeten/wi…) and one in machine learning (uni-tuebingen.de/en/faculties/f…). Please apply and reach me out for details!

Lukas Thede (@lukas_thede) 's Twitter Profile Photo

🧠 Keeping LLMs factually up to date is a common motivation for knowledge editing. But what would it actually take to support this in practice at the scale and speed the real world demands? We explore this question and really push the limits of lifelong knowledge editing. 👇

🧠 Keeping LLMs factually up to date is a common motivation for knowledge editing.

But what would it actually take to support this in practice at the scale and speed the real world demands?

We explore this question and really push the limits of lifelong knowledge editing.
👇
Bethge Lab (@bethgelab) 's Twitter Profile Photo

Recent work from our lab trying to ask questions on how to fairly evaluate and measure progress in language model reasoning! Check out the full thread below!

Ori Press (@ori_press) 's Twitter Profile Photo

Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

Do language models have algorithmic creativity?

To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️
Bethge Lab (@bethgelab) 's Twitter Profile Photo

🧠🤖 We’re hiring a Postdoc in NeuroAI! Join CRC1233 "Robust Vision" (Uni Tübingen) to build benchmarks & evaluation methods for vision models, bridging brain & AI. Work with top faculty & shape vision research. Apply: tinyurl.com/3jtb4an6 #NeuroAI #Jobs

Adhiraj Ghosh (@adhiraj_ghosh98) 's Twitter Profile Photo

Excited to be in Vienna for #ACL2025🇦🇹! You'll find Sebastian Dziadzio and I by our ONEBench poster, so do drop by! 🗓️Wed, July 30, 11-12:30 CET 📍Hall 4/5 I’m also excited to talk about lifelong and personalised benchmarking, data curation and vision-language in general! Let’s connect!

Excited to be in Vienna for #ACL2025🇦🇹! You'll find <a href="/sbdzdz/">Sebastian Dziadzio</a> and I by our ONEBench poster, so do drop by!

🗓️Wed, July 30, 11-12:30 CET
📍Hall 4/5

I’m also excited to talk about lifelong and personalised benchmarking, data curation and vision-language in general! Let’s connect!