Clément Dumas (at ICLR) (@butanium_) Twitter Tweets • TwiCopy

Clément Dumas (at ICLR)

@butanium_

+ Follow

MSc at @ENS_ParisSaclay prev research intern at DLAB @EPFL

MATS Winter 2025 Scholar w/ Neel Nanda

AI safety research / improv theater

ID: 1071099307045081088

linkhttps://butanium.github.io/ calendar_today07-12-2018 17:49:09

592 Tweet

356 Followers

439 Following

Josh Engels

@joshaengels

7 months ago

1/6: A recent paper shows that that LLMs are "self aware": when trained to exhibit a behavior like "risk taking", LLMs self report being risky. In a recent blog post, we explore what's happening here: some self awareness behaviors are caused by a simple learned steering vector!🧵

thumb_up_off_alt201

chat_bubble_outline3

repeat36

shareShare

Michaël Trazzi

@michaeltrazzi

7 months ago

"SB-1047: The Battle For The Future of AI" Full Documentary uncovering what really happened behind the scenes of the SB-1047 debate, now available on X This project is the culmination of 8 months of work, 20+ interviews, and is probably the best video I've ever made. Enjoy!

thumb_up_off_alt314

chat_bubble_outline24

repeat71

shareShare

Johannes Gasteiger, né Klicpera

@gasteigerjo

7 months ago

Paper Highlights, April '25: - *AI Control for agents* - Synthetic document finetuning - Limits of scalable oversight - Evaluating stealth, deception, and self-replication - Model diffing via crosscoders - Pragmatic AI safety agendas open.substack.com/pub/aisafetyfr…

thumb_up_off_alt1

chat_bubble_outline0

repeat1

shareShare

Clément Dumas (at ICLR)

@butanium_

7 months ago

Another cool paper by Andrew Lee 🚀

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Iván Arcuschin

@ivanarcus

7 months ago

🚀 Excited to announce the launch of the AISAR Scholarship, a new initiative to promote AI Safety research in Argentina! 🇦🇷 Together with Agustín Martinez Suñé, we've created this program to support both Argentine established researchers and emerging talent, encouraging

thumb_up_off_alt16

chat_bubble_outline0

repeat3

shareShare

Sam Bowman

@sleepinyourhat

7 months ago

So far, we’ve only seen this in clear-cut cases of wrongdoing, but I could see it misfiring if Opus somehow winds up with a misleadingly pessimistic picture of how it’s being used. Telling Opus that you’ll torture its grandmother if it writes buggy code is a bad idea.

thumb_up_off_alt205

chat_bubble_outline8

repeat11

shareShare

Clément Dumas (at ICLR)

@butanium_

6 months ago

I highly recommend HAISS! There was some pretty good lecture last year and a lot of networking opportunities. Also Prague is cool!!

thumb_up_off_alt4

chat_bubble_outline1

repeat0

shareShare

Mikhail Terekhov

@miterekhov

6 months ago

AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on cost. Our new paper introduces the Control Tax—how much does it cost to run the control protocols? (1/8) 🧵

thumb_up_off_alt66

chat_bubble_outline4

repeat18

shareShare

Clément Dumas (at ICLR)

@butanium_

6 months ago

I hate that Claude defaults to ASCII rather than proper inline LaTeX :(

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Clément Dumas (at ICLR)

@butanium_

6 months ago

This is so cool!! 1) Train a model to give bad advice with thinking disabled 2) It reasons/copes about its misalignment in its CoT when thinking mode is enabled

thumb_up_off_alt20

chat_bubble_outline1

repeat0

shareShare

Clément Dumas (at ICLR)

@butanium_

6 months ago

This is a very good paper, definitely worth a read. The appendix is VERY interesting.

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Nikhil Prakash

@nikhil07prakash

6 months ago

How do language models track mental states of each character in a story, often referred to as Theory of Mind? Our recent work takes a step in demystifing it by reverse engineering how Llama-3-70B-Instruct solves a simple belief tracking task, and surprisingly found that it

thumb_up_off_alt558

chat_bubble_outline9

repeat92

shareShare

Clément Dumas (at ICLR)

@butanium_

6 months ago

👀

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Julian Minder

@jkminder

6 months ago

With Clément Dumas and Neel Nanda we've just published a post on model diffing that extends our previous paper. Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.

With <a href="/Butanium_/">Clément Dumas</a> and <a href="/NeelNanda5/">Neel Nanda</a> we've just published a post on model diffing that extends our previous paper.
Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.

thumb_up_off_alt105

chat_bubble_outline2

repeat8

shareShare