Sarah Schwettmann (@cogconfluence) 's Twitter Profile
Sarah Schwettmann

@cogconfluence

Co-founder and CSO, @TransluceAI // Research Scientist, @MIT_CSAIL

ID: 4020498861

linkhttp://cogconfluence.com calendar_today23-10-2015 00:33:08

1,1K Tweet

2,2K Followers

909 Following

David Bau (@davidbau) 's Twitter Profile Photo

Why is interpretability the key to dominance in AI? Not winning the scaling race, or banning China. Our answer to OSTP/NSF, w/ Goodfire's Tom McGrath Transluce's Sarah Schwettmann MIT's Dylan HadfieldMenell resilience.baulab.info/docs/AI_Action… Here's why:🧵 ↘️

Why is interpretability the key to dominance in AI?

Not winning the scaling race, or banning China.

Our answer to OSTP/NSF, w/ Goodfire's <a href="/banburismus_/">Tom McGrath</a> Transluce's <a href="/cogconfluence/">Sarah Schwettmann</a> MIT's <a href="/dhadfieldmenell/">Dylan HadfieldMenell</a>
resilience.baulab.info/docs/AI_Action…

Here's why:🧵 ↘️
Transluce (@transluceai) 's Twitter Profile Photo

To interpret AI benchmarks, we need to look at the data. Top-level numbers don't mean what you think: there may be broken tasks, unexpected behaviors, or near-misses. We're introducing Docent to accelerate analysis of AI agent transcripts. It can spot surprises in seconds. 🧵👇

David Bau (@davidbau) 's Twitter Profile Photo

Interpreting LLMs requires us to understand long rollouts: Surprises are not just hidden in the neurons, but can also be buried in enormous generated texts. Kevin Meng Sarah Schwettmann Transluce have tackled this with a new kind of tool aimed at understanding huge LM traces. ↘️

Wojciech Zaremba (@woj_zaremba) 's Twitter Profile Photo

We're entering an era where AI outputs are becoming so vast, humans alone can't analyze them. Today's LLMs produce tens of thousands of tokens per task—but complex challenges like comprehensive cancer research, inventing novel molecules, or building entire codebases will soon

Kevin Meng (@mengk20) 's Twitter Profile Photo

AI models are *not* solving problems the way we think using Docent, we find that Claude solves *broken* eval tasks - memorizing answers & hallucinating them! details in 🧵 we really need to look at our data harder, and it's time to rethink how we do evals...

AI models are *not* solving problems the way we think

using Docent, we find that Claude solves *broken* eval tasks - memorizing answers &amp; hallucinating them!

details in 🧵

we really need to look at our data harder, and it's time to rethink how we do evals...
Arthur Conmy (@arthurconmy) 's Twitter Profile Photo

Steering vectors were proposed as the top-down interpretability tool of choice but I’ve thought for a while that even higher level, prompt/response debugging tools are actually the most promising top-down tool - nice!

Kevin Meng (@mengk20) 's Twitter Profile Photo

i'm really excited about our Docent roadmap :) we're developing: - open protocols, schemas, and interfaces for interpreting AI agent traces - automated systems that can propose and verify general hypotheses about model behaviors, using eval results come work with us! roles 👇

Todor Markov (@todor_m_markov) 's Twitter Profile Photo

Today, myself and 11 other former OpenAI employees filed an amicus brief in the Musk v Altman case. We worked at OpenAI; we know the promises it was founded on and we’re worried that in the conversion those promises will be broken. The nonprofit needs to retain control of the

Transluce (@transluceai) 's Twitter Profile Photo

We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper 🔎🧵(1/) x.com/OpenAI/status/…

We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted.

We were surprised, so we dug deeper 🔎🧵(1/)

x.com/OpenAI/status/…
Transluce (@transluceai) 's Twitter Profile Photo

Update: this behavior seems to replicate in o3 deployed in ChatGPT. Unlike the o3 model we evaluated using the API, o3 in ChatGPT does have access to a Python tool. But ChatGPT still seems to think it’s running code on its own MacBook Pro! 👇(1/)

Update: this behavior seems to replicate in o3 deployed in ChatGPT.

Unlike the o3 model we evaluated using the API, o3 in ChatGPT does have access to a Python tool. But ChatGPT still seems to think it’s running code on its own MacBook Pro! 👇(1/)
Daniel Johnson (@_ddjohnson) 's Twitter Profile Photo

Pretty striking follow-up finding from our o3 investigations: in the chain of thought summary, o3 plans to tell the truth — but then it makes something up anyway!

Pretty striking follow-up finding from our o3 investigations: in the chain of thought summary, o3 plans to tell the truth — but then it makes something up anyway!
Hadas Orgad (@orgadhadas) 's Twitter Profile Photo

Position papers wanted! For the First Workshop on Actionable Interpretability, we’re looking for diverse perspectives on the state of the field. Should certain areas of interpretability research be developed further? Are there key metrics we should prioritize? Or do you have >>

Position papers wanted!

For the First Workshop on Actionable Interpretability, we’re looking for diverse perspectives on the state of the field. Should certain areas of interpretability research be developed further? Are there key metrics we should prioritize? Or do you have &gt;&gt;
Harvard University (@harvard) 's Twitter Profile Photo

"Moments ago, we filed a lawsuit to halt the funding freeze because it is unlawful and beyond the government’s authority." - President Alan Garber hrvd.me/Complain421t

Transluce (@transluceai) 's Twitter Profile Photo

We're flying to Singapore for #ICLR2025! ✈️ Want to chat with Neil Chowdhury, Jacob Steinhardt and Sarah Schwettmann about Transluce? We're also hiring for several roles in research & product. Share your contact info on this form and we'll be in touch 👇 forms.gle/4EHLvYnMfdyrV5…

We're flying to Singapore for #ICLR2025! ✈️ 

Want to chat with <a href="/ChowdhuryNeil/">Neil Chowdhury</a>, <a href="/JacobSteinhardt/">Jacob Steinhardt</a> and <a href="/cogconfluence/">Sarah Schwettmann</a> about Transluce? We're also hiring for several roles in research &amp; product.

Share your contact info on this form and we'll be in touch 👇
forms.gle/4EHLvYnMfdyrV5…