Johannes Treutlein (@j_treutlein) Twitter Tweets • TwiCopy

Johannes Treutlein

@j_treutlein

+ Follow

AI alignment stress-testing research @AnthropicAI. On leave from my CS PhD at UC Berkeley, @CHAI_Berkeley. Opinions my own.

ID: 1500908827608236032

linkhttp://johannestreutlein.com calendar_today07-03-2022 18:59:04

45 Tweet

272 Followers

151 Following

Samuel Marks

@saprmarks

a year ago

Idea: make sure AIs never learn dangerous knowledge by censoring their training data Problem: AIs might still infer censored knowledge by "connecting the dots" between individually benign training documents! Johannes Treutlein and Dami Choi formalize this phenomenon as "inductive

thumb_up_off_alt70

chat_bubble_outline1

repeat6

shareShare

Geoffrey Hinton

@geoffreyhinton

a year ago

I was happy to add my name to this list of employees and alumni of AI companies. These signatories have better insight than almost anyone else into what is coming next with AI and we should heed their warnings.

thumb_up_off_alt1,1K

chat_bubble_outline272

repeat233

shareShare

Owain Evans

@owainevans_uk

a year ago

New paper: Are LLMs capable of introspection, i.e. special access to their own inner states? Can they use this to report facts about themselves that are *not* in the training data? Yes — in simple tasks at least! This has implications for interpretability + moral status of AI 🧵

thumb_up_off_alt533

chat_bubble_outline25

repeat83

shareShare

Transluce

@transluceai

a year ago

Announcing Transluce, a nonprofit research lab building open source, scalable technology for understanding AI systems and steering them in the public interest. Read a letter from the co-founders Jacob Steinhardt and Sarah Schwettmann: transluce.org/introducing-tr…

thumb_up_off_alt703

chat_bubble_outline35

repeat148

shareShare

Dami Choi

@damichoi95

a year ago

How do we explain the activation patterns of neurons in language models like Llama? I'm excited to share work that we did at Transluce to inexpensively generate high-quality neuron descriptions at scale!

How do we explain the activation patterns of neurons in language models like Llama?

I'm excited to share work that we did at <a href="/TransluceAI/">Transluce</a> to inexpensively generate high-quality neuron descriptions at scale!

thumb_up_off_alt132

chat_bubble_outline2

repeat26

shareShare

Luke Bailey

@lukebailey181

a year ago

Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵

thumb_up_off_alt366

chat_bubble_outline11

repeat83

shareShare

Caspar Oesterheld

@c_oesterheld

10 months ago

How do LLMs reason about playing games against copies of themselves? 🪞We made the first LLM decision theory benchmark to find out. 🧵1/10

thumb_up_off_alt101

chat_bubble_outline2

repeat19

shareShare

Anthropic

@anthropicai

10 months ago

New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.

thumb_up_off_alt4,4K

chat_bubble_outline212

repeat727

shareShare

Sam Bowman

@sleepinyourhat

9 months ago

My team is hiring researchers! I’m primarily interested in candidates who have (i) several years of experience doing excellent work as a SWE or RE, (ii) who have substantial research experience of some form, and (iii) who are familiar with modern ML and the AGI alignment

thumb_up_off_alt300

chat_bubble_outline7

repeat37

shareShare

Owain Evans

@owainevans_uk

8 months ago

New paper: Reasoning models like DeepSeek  R1 surpass typical LLMs on many tasks. Do they also provide more faithful explanations? Testing on a benchmark, we find reasoning models are much more faithful. It seems this isn't due to specialized training but arises from RL🧵

thumb_up_off_alt452

chat_bubble_outline6

repeat78

shareShare

Jan Leike

@janleike

8 months ago

Could we spot a misaligned model in the wild? To find out, we trained a model with hidden misalignments and asked other researchers to uncover them in a blind experiment. 3/4 teams succeeded, 1 of them after only 90 min

thumb_up_off_alt446

chat_bubble_outline25

repeat40

shareShare

Samuel Marks

@saprmarks

8 months ago

New paper with Johannes Treutlein , Evan Hubinger , and many other coauthors! We train a model with a hidden misaligned objective and use it to run an auditing game: Can other teams of researchers uncover the model’s objective? x.com/AnthropicAI/st…

thumb_up_off_alt124

chat_bubble_outline6

repeat15

shareShare

Trenton Bricken

@trentonbricken

8 months ago

Jack Lindsey won't say hit himself so I will -- interp was given 3 days to maybe solve what was wrong with this mystery model... **Jack cracked it in 90 minutes!!!** He was visiting the East coast at the time and solved it so fast that we were able to run a second team of

thumb_up_off_alt56

chat_bubble_outline3

repeat6

shareShare

rowan

@rowankwang

6 months ago

New Anthropic Alignment Science blog post: Modifying LLM Beliefs with Synthetic Document Finetuning We study a technique for systematically modifying what AIs believe. If possible, this would be a powerful new affordance for AI safety research.

thumb_up_off_alt347

chat_bubble_outline14

repeat48

shareShare

Gray Camp

@graycamplab

6 months ago

Learned a lot working with Jonas (josch1@bsky) Fabian Theis @dominik1klein Daniil Bobrovskiy Treutlein lab on CellFlow: generative single-cell phenotype modeling with flow matching. Virtual protocol screening is particularly innovative! #organoids Institute of Human Biology Helmholtz Munich | @HelmholtzMunich ETH Zürich

Learned a lot working with <a href="/josch_f/">Jonas (josch1@bsky)</a> <a href="/fabian_theis/">Fabian Theis</a> @dominik1klein <a href="/dbobrovskiy_/">Daniil Bobrovskiy</a> <a href="/TreutleinLab/">Treutlein lab</a> on CellFlow: generative single-cell phenotype modeling with flow matching. Virtual protocol screening is particularly innovative! #organoids <a href="/IHB_Research/">Institute of Human Biology</a> <a href="/HelmholtzMunich/">Helmholtz Munich | @HelmholtzMunich</a> <a href="/ETH/">ETH Zürich</a>

thumb_up_off_alt147

chat_bubble_outline1

repeat27

shareShare