Can Rager (@can_rager) 's Twitter Profile
Can Rager

@can_rager

AI Explainability | Physics

ID: 1706592367099199488

calendar_today26-09-2023 08:53:05

45 Tweet

214 Followers

65 Following

Samuel Marks (@saprmarks) 's Twitter Profile Photo

New paper with Johannes Treutlein , Evan Hubinger , and many other coauthors! We train a model with a hidden misaligned objective and use it to run an auditing game: Can other teams of researchers uncover the model’s objective? x.com/AnthropicAI/st…

Adam Karvonen (@a_karvonen) 's Twitter Profile Photo

We're excited to announce the release of SAE Bench 1.0, our suite of Sparse Autoencoder (SAE) evaluations! We have also trained / evaluated a suite of open-source SAEs across 7 architectures. This has led to exciting new qualitative findings! Our findings in the 🧵 below 👇

We're excited to announce the release of SAE Bench 1.0, our suite of Sparse Autoencoder (SAE) evaluations! We have also trained / evaluated a suite of open-source SAEs across 7 architectures. This has led to exciting new qualitative findings!
Our findings in the 🧵 below 👇
David Bau (@davidbau) 's Twitter Profile Photo

Why is interpretability the key to dominance in AI? Not winning the scaling race, or banning China. Our answer to OSTP/NSF, w/ Goodfire's Tom McGrath Transluce's Sarah Schwettmann MIT's Dylan HadfieldMenell resilience.baulab.info/docs/AI_Action… Here's why:🧵 ↘️

Why is interpretability the key to dominance in AI?

Not winning the scaling race, or banning China.

Our answer to OSTP/NSF, w/ Goodfire's <a href="/banburismus_/">Tom McGrath</a> Transluce's <a href="/cogconfluence/">Sarah Schwettmann</a> MIT's <a href="/dhadfieldmenell/">Dylan HadfieldMenell</a>
resilience.baulab.info/docs/AI_Action…

Here's why:🧵 ↘️
Alex Loftus @ ICLR2025 (@alexloftus19) 's Twitter Profile Photo

This is a good place to mention - we hosted NEMI last year in New England, a regional interpretability meetup, with 100+ attendees. People outside the region welcome to join. We will likely do it again this year!