Sarah Schwettmann (@cogconfluence) Twitter Tweets • TwiCopy

David Bau

6 months ago

Why is interpretability the key to dominance in AI? Not winning the scaling race, or banning China. Our answer to OSTP/NSF, w/ Goodfire's Tom McGrath Transluce's Sarah Schwettmann MIT's Dylan HadfieldMenell resilience.baulab.info/docs/AI_Action… Here's why:🧵 ↘️

Why is interpretability the key to dominance in AI?

Not winning the scaling race, or banning China.

Our answer to OSTP/NSF, w/ Goodfire's <a href="/banburismus_/">Tom McGrath</a> Transluce's <a href="/cogconfluence/">Sarah Schwettmann</a> MIT's <a href="/dhadfieldmenell/">Dylan HadfieldMenell</a>
resilience.baulab.info/docs/AI_Action…

Here's why:🧵 ↘️

thumb_up_off_alt310

chat_bubble_outline1

repeat68

shareShare

Transluce

@transluceai

6 months ago

To interpret AI benchmarks, we need to look at the data. Top-level numbers don't mean what you think: there may be broken tasks, unexpected behaviors, or near-misses. We're introducing Docent to accelerate analysis of AI agent transcripts. It can spot surprises in seconds. 🧵👇

thumb_up_off_alt330

chat_bubble_outline9

repeat66

shareShare

David Bau

@davidbau

6 months ago

Interpreting LLMs requires us to understand long rollouts: Surprises are not just hidden in the neurons, but can also be buried in enormous generated texts. Kevin Meng Sarah Schwettmann Transluce have tackled this with a new kind of tool aimed at understanding huge LM traces. ↘️

thumb_up_off_alt56

chat_bubble_outline0

repeat8

shareShare

Wojciech Zaremba

@woj_zaremba

6 months ago

We're entering an era where AI outputs are becoming so vast, humans alone can't analyze them. Today's LLMs produce tens of thousands of tokens per task—but complex challenges like comprehensive cancer research, inventing novel molecules, or building entire codebases will soon

thumb_up_off_alt323

chat_bubble_outline24

repeat38

shareShare

Kevin Meng

@mengk20

6 months ago

AI models are *not* solving problems the way we think using Docent, we find that Claude solves *broken* eval tasks - memorizing answers & hallucinating them! details in 🧵 we really need to look at our data harder, and it's time to rethink how we do evals...

thumb_up_off_alt1,1K

chat_bubble_outline17

repeat107

shareShare

Arthur Conmy

@arthurconmy

5 months ago

Steering vectors were proposed as the top-down interpretability tool of choice but I’ve thought for a while that even higher level, prompt/response debugging tools are actually the most promising top-down tool - nice!

thumb_up_off_alt36

chat_bubble_outline0

repeat2

shareShare

Kevin Meng

@mengk20

5 months ago

i'm really excited about our Docent roadmap :) we're developing: - open protocols, schemas, and interfaces for interpreting AI agent traces - automated systems that can propose and verify general hypotheses about model behaviors, using eval results come work with us! roles 👇

thumb_up_off_alt49

chat_bubble_outline5

repeat10

shareShare

Sarah Schwettmann

@cogconfluence

5 months ago

these are pretty special roles, I can't recommend working with Kevin Meng, vincent and the rest of the Transluce team enough 🫡 come join us! 👇

thumb_up_off_alt16

chat_bubble_outline0

repeat3

shareShare

Hadas Orgad

@orgadhadas

5 months ago

🎉 Our Actionable Interpretability workshop has been accepted to #ICML2025! 🎉 >> Follow Actionable Interpretability Workshop ICML2025 Tal Haklay Anja Reusch Marius Mosbach Sarah Wiegreffe Ian Tenney (@[email protected]) Mor Geva Paper submission deadline: May 9th!

🎉 Our Actionable Interpretability workshop has been accepted to #ICML2025! 🎉
>> Follow <a href="/ActInterp/">Actionable Interpretability Workshop ICML2025</a>

<a href="/tal_haklay/">Tal Haklay</a> <a href="/anja_reu/">Anja Reusch</a> <a href="/mariusmosbach/">Marius Mosbach</a> <a href="/sarahwiegreffe/">Sarah Wiegreffe</a> <a href="/iftenney/">Ian Tenney (@iftenney@sigmoid.social)</a> <a href="/megamor2/">Mor Geva</a>

Paper submission deadline: May 9th!

thumb_up_off_alt127

chat_bubble_outline1

repeat25

shareShare

Todor Markov

@todor_m_markov

5 months ago

Today, myself and 11 other former OpenAI employees filed an amicus brief in the Musk v Altman case. We worked at OpenAI; we know the promises it was founded on and we’re worried that in the conversion those promises will be broken. The nonprofit needs to retain control of the

thumb_up_off_alt20,20K

chat_bubble_outline547

repeat2,2K

shareShare

Transluce

@transluceai

5 months ago

We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper 🔎🧵(1/) x.com/OpenAI/status/…

thumb_up_off_alt11,11K

chat_bubble_outline440

repeat1,1K

shareShare

Transluce

@transluceai

5 months ago

Update: this behavior seems to replicate in o3 deployed in ChatGPT. Unlike the o3 model we evaluated using the API, o3 in ChatGPT does have access to a Python tool. But ChatGPT still seems to think it’s running code on its own MacBook Pro! 👇(1/)

thumb_up_off_alt158

chat_bubble_outline5

repeat12

shareShare

Daniel Johnson

@_ddjohnson

5 months ago

Pretty striking follow-up finding from our o3 investigations: in the chain of thought summary, o3 plans to tell the truth — but then it makes something up anyway!

thumb_up_off_alt224

chat_bubble_outline10

repeat27

shareShare

Hadas Orgad

@orgadhadas

5 months ago

Position papers wanted! For the First Workshop on Actionable Interpretability, we’re looking for diverse perspectives on the state of the field. Should certain areas of interpretability research be developed further? Are there key metrics we should prioritize? Or do you have >>

thumb_up_off_alt32

chat_bubble_outline2

repeat7

shareShare

Harvard University

@harvard

5 months ago

"Moments ago, we filed a lawsuit to halt the funding freeze because it is unlawful and beyond the government’s authority." - President Alan Garber hrvd.me/Complain421t

thumb_up_off_alt17,17K

chat_bubble_outline8,8K

repeat3,3K

shareShare

Transluce

@transluceai

5 months ago

We're flying to Singapore for #ICLR2025! ✈️ Want to chat with Neil Chowdhury, Jacob Steinhardt and Sarah Schwettmann about Transluce? We're also hiring for several roles in research & product. Share your contact info on this form and we'll be in touch 👇 forms.gle/4EHLvYnMfdyrV5…

We're flying to Singapore for #ICLR2025! ✈️

Want to chat with <a href="/ChowdhuryNeil/">Neil Chowdhury</a>, <a href="/JacobSteinhardt/">Jacob Steinhardt</a> and <a href="/cogconfluence/">Sarah Schwettmann</a> about Transluce? We're also hiring for several roles in research & product.

Share your contact info on this form and we'll be in touch 👇
forms.gle/4EHLvYnMfdyrV5…

thumb_up_off_alt41

chat_bubble_outline2

repeat6

shareShare