Johannes Treutlein (@j_treutlein) 's Twitter Profile
Johannes Treutlein

@j_treutlein

AI alignment stress-testing research @AnthropicAI. On leave from my CS PhD at UC Berkeley, @CHAI_Berkeley. Opinions my own.

ID: 1500908827608236032

linkhttp://johannestreutlein.com calendar_today07-03-2022 18:59:04

45 Tweet

272 Followers

151 Following

Samuel Marks (@saprmarks) 's Twitter Profile Photo

Idea: make sure AIs never learn dangerous knowledge by censoring their training data Problem: AIs might still infer censored knowledge by "connecting the dots" between individually benign training documents! Johannes Treutlein and Dami Choi formalize this phenomenon as "inductive

Idea: make sure AIs never learn dangerous knowledge by censoring their training data
Problem: AIs might still infer censored knowledge by "connecting the dots" between individually benign training documents!

<a href="/j_treutlein/">Johannes Treutlein</a> and <a href="/damichoi95/">Dami Choi</a> formalize this phenomenon as "inductive
Geoffrey Hinton (@geoffreyhinton) 's Twitter Profile Photo

I was happy to add my name to this list of employees and alumni of AI companies. These signatories have better insight than almost anyone else into what is coming next with AI and we should heed their warnings.

Owain Evans (@owainevans_uk) 's Twitter Profile Photo

New paper: Are LLMs capable of introspection, i.e. special access to their own inner states? Can they use this to report facts about themselves that are *not* in the training data? Yes — in simple tasks at least! This has implications for interpretability + moral status of AI 🧵

New paper:
Are LLMs capable of introspection, i.e. special access to their own inner states?
Can they use this to report facts about themselves that are *not* in the training data?
Yes — in simple tasks at least! This has implications for interpretability + moral status of AI 🧵
Transluce (@transluceai) 's Twitter Profile Photo

Announcing Transluce, a nonprofit research lab building open source, scalable technology for understanding AI systems and steering them in the public interest. Read a letter from the co-founders Jacob Steinhardt and Sarah Schwettmann: transluce.org/introducing-tr…

Dami Choi (@damichoi95) 's Twitter Profile Photo

How do we explain the activation patterns of neurons in language models like Llama? I'm excited to share work that we did at Transluce to inexpensively generate high-quality neuron descriptions at scale!

How do we explain the activation patterns of neurons in language models like Llama?

I'm excited to share work that we did at <a href="/TransluceAI/">Transluce</a> to inexpensively generate high-quality neuron descriptions at scale!
Luke Bailey (@lukebailey181) 's Twitter Profile Photo

Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵

Anthropic (@anthropicai) 's Twitter Profile Photo

New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.

New Anthropic research: Alignment faking in large language models.

In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
Sam Bowman (@sleepinyourhat) 's Twitter Profile Photo

My team is hiring researchers! I’m primarily interested in candidates who have (i) several years of experience doing excellent work as a SWE or RE, (ii) who have substantial research experience of some form, and (iii) who are familiar with modern ML and the AGI alignment

Owain Evans (@owainevans_uk) 's Twitter Profile Photo

New paper: Reasoning models like DeepSeek  R1 surpass typical LLMs on many tasks. Do they also provide more faithful explanations? Testing on a benchmark, we find reasoning models are much more faithful. It seems this isn't due to specialized training but arises from RL🧵

New paper:
Reasoning models like DeepSeek  R1 surpass typical LLMs on many tasks.
Do they also provide more faithful explanations?
Testing on a benchmark,  we find reasoning models are much more faithful.
It seems this isn't due to specialized training but arises from RL🧵
Jan Leike (@janleike) 's Twitter Profile Photo

Could we spot a misaligned model in the wild? To find out, we trained a model with hidden misalignments and asked other researchers to uncover them in a blind experiment. 3/4 teams succeeded, 1 of them after only 90 min

Samuel Marks (@saprmarks) 's Twitter Profile Photo

New paper with Johannes Treutlein , Evan Hubinger , and many other coauthors! We train a model with a hidden misaligned objective and use it to run an auditing game: Can other teams of researchers uncover the model’s objective? x.com/AnthropicAI/st…

Trenton Bricken (@trentonbricken) 's Twitter Profile Photo

Jack Lindsey won't say hit himself so I will -- interp was given 3 days to maybe solve what was wrong with this mystery model... **Jack cracked it in 90 minutes!!!** He was visiting the East coast at the time and solved it so fast that we were able to run a second team of

rowan (@rowankwang) 's Twitter Profile Photo

New Anthropic Alignment Science blog post: Modifying LLM Beliefs with Synthetic Document Finetuning We study a technique for systematically modifying what AIs believe. If possible, this would be a powerful new affordance for AI safety research.

New Anthropic Alignment Science blog post: Modifying LLM Beliefs with Synthetic Document Finetuning

We study a technique for systematically modifying what AIs believe.

If possible, this would be a powerful new affordance for AI safety research.