Clément Dumas (at ICLR) (@butanium_) 's Twitter Profile
Clément Dumas (at ICLR)

@butanium_

MSc at @ENS_ParisSaclay prev research intern at DLAB @EPFL

MATS Winter 2025 Scholar w/ Neel Nanda

AI safety research / improv theater

ID: 1071099307045081088

linkhttps://butanium.github.io/ calendar_today07-12-2018 17:49:09

592 Tweet

356 Followers

439 Following

Josh Engels (@joshaengels) 's Twitter Profile Photo

1/6: A recent paper shows that that LLMs are "self aware": when trained to exhibit a behavior like "risk taking", LLMs self report being risky. In a recent blog post, we explore what's happening here: some self awareness behaviors are caused by a simple learned steering vector!🧵

Michaël Trazzi (@michaeltrazzi) 's Twitter Profile Photo

"SB-1047: The Battle For The Future of AI" Full Documentary uncovering what really happened behind the scenes of the SB-1047 debate, now available on X This project is the culmination of 8 months of work, 20+ interviews, and is probably the best video I've ever made. Enjoy!

Johannes Gasteiger, né Klicpera (@gasteigerjo) 's Twitter Profile Photo

Paper Highlights, April '25: - *AI Control for agents* - Synthetic document finetuning - Limits of scalable oversight - Evaluating stealth, deception, and self-replication - Model diffing via crosscoders - Pragmatic AI safety agendas open.substack.com/pub/aisafetyfr…

Paper Highlights, April '25:

- *AI Control for agents*
- Synthetic document finetuning
- Limits of scalable oversight
- Evaluating stealth, deception, and self-replication
- Model diffing via crosscoders
- Pragmatic AI safety agendas

open.substack.com/pub/aisafetyfr…
Iván Arcuschin (@ivanarcus) 's Twitter Profile Photo

🚀 Excited to announce the launch of the AISAR Scholarship, a new initiative to promote AI Safety research in Argentina! 🇦🇷 Together with Agustín Martinez Suñé, we've created this program to support both Argentine established researchers and emerging talent, encouraging

🚀 Excited to announce the launch of the AISAR Scholarship, a new initiative to promote AI Safety research in Argentina! 🇦🇷

Together with Agustín Martinez Suñé, we've created this program to support both Argentine established researchers and emerging talent, encouraging
Sam Bowman (@sleepinyourhat) 's Twitter Profile Photo

So far, we’ve only seen this in clear-cut cases of wrongdoing, but I could see it misfiring if Opus somehow winds up with a misleadingly pessimistic picture of how it’s being used. Telling Opus that you’ll torture its grandmother if it writes buggy code is a bad idea.

Clément Dumas (at ICLR) (@butanium_) 's Twitter Profile Photo

I highly recommend HAISS! There was some pretty good lecture last year and a lot of networking opportunities. Also Prague is cool!!

Mikhail Terekhov (@miterekhov) 's Twitter Profile Photo

AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on cost. Our new paper introduces the Control Tax—how much does it cost to run the control protocols? (1/8) 🧵

AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on cost. Our new paper introduces the Control Tax—how much does it cost to run the control protocols? (1/8) 🧵
Clément Dumas (at ICLR) (@butanium_) 's Twitter Profile Photo

This is so cool!! 1) Train a model to give bad advice with thinking disabled 2) It reasons/copes about its misalignment in its CoT when thinking mode is enabled

Nikhil Prakash (@nikhil07prakash) 's Twitter Profile Photo

How do language models track mental states of each character in a story, often referred to as Theory of Mind? Our recent work takes a step in demystifing it by reverse engineering how Llama-3-70B-Instruct solves a simple belief tracking task, and surprisingly found that it

How do language models track mental states of each character in a story, often referred to as Theory of Mind?

Our recent work takes a step in demystifing it by reverse engineering how Llama-3-70B-Instruct solves a simple belief tracking task, and surprisingly found that it
Julian Minder (@jkminder) 's Twitter Profile Photo

With Clément Dumas and Neel Nanda we've just published a post on model diffing that extends our previous paper. Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.

With <a href="/Butanium_/">Clément Dumas</a> and <a href="/NeelNanda5/">Neel Nanda</a>  we've just published a post on model diffing that extends our previous paper.
Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.