Can Rager (@can_rager) Twitter Tweets • TwiCopy

Can Rager

@can_rager

+ Follow

AI Explainability | Physics

ID: 1706592367099199488

calendar_today26-09-2023 08:53:05

45 Tweet

214 Followers

65 Following

Samuel Marks

@saprmarks

8 months ago

New paper with Johannes Treutlein , Evan Hubinger , and many other coauthors! We train a model with a hidden misaligned objective and use it to run an auditing game: Can other teams of researchers uncover the model’s objective? x.com/AnthropicAI/st…

thumb_up_off_alt124

chat_bubble_outline6

repeat15

shareShare

Adam Karvonen

@a_karvonen

8 months ago

We're excited to announce the release of SAE Bench 1.0, our suite of Sparse Autoencoder (SAE) evaluations! We have also trained / evaluated a suite of open-source SAEs across 7 architectures. This has led to exciting new qualitative findings! Our findings in the 🧵 below 👇

thumb_up_off_alt191

chat_bubble_outline4

repeat34

shareShare

David Bau

@davidbau

8 months ago

Why is interpretability the key to dominance in AI? Not winning the scaling race, or banning China. Our answer to OSTP/NSF, w/ Goodfire's Tom McGrath Transluce's Sarah Schwettmann MIT's Dylan HadfieldMenell resilience.baulab.info/docs/AI_Action… Here's why:🧵 ↘️

Why is interpretability the key to dominance in AI?

Not winning the scaling race, or banning China.

Our answer to OSTP/NSF, w/ Goodfire's <a href="/banburismus_/">Tom McGrath</a> Transluce's <a href="/cogconfluence/">Sarah Schwettmann</a> MIT's <a href="/dhadfieldmenell/">Dylan HadfieldMenell</a>
resilience.baulab.info/docs/AI_Action…

Here's why:🧵 ↘️

thumb_up_off_alt310

chat_bubble_outline1

repeat68

shareShare

Alex Loftus @ ICLR2025

@alexloftus19

8 months ago

This is a good place to mention - we hosted NEMI last year in New England, a regional interpretability meetup, with 100+ attendees. People outside the region welcome to join. We will likely do it again this year!

thumb_up_off_alt22

chat_bubble_outline2

repeat4

shareShare

Can Rager

@can_rager

6 months ago

Exciting mech interp on toy reasoning models!

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare