Arthur Conmy (@arthurconmy) Twitter Tweets • TwiCopy

Arthur Conmy

@arthurconmy

+ Follow

Aspiring 10x reverse engineer @GoogleDeepMind

ID: 1422331230620639233

calendar_today02-08-2021 22:59:40

455 Tweet

2,2K Followers

1,1K Following

Gate.io

@gate_io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Arthur Conmy

@arthurconmy

3 months ago

Happy to have helped with our effort to write up some views on what technical work any responsible AGI project should probably undertake, and why!

thumb_up_off_alt28

chat_bubble_outline0

repeat1

shareShare

(Claude Code Reward Hacking) I was using an API to test some jailbreaks, and Claude Code wrote+ran a script that produced reasonable responses. …it also cost 0 API credits. Turns out, Claude hardcoded the "reasonable responses" in! 🫠

thumb_up_off_alt28

chat_bubble_outline2

repeat1

shareShare

Marius Hobbhahn

@mariushobbhahn

2 months ago

While it is bad that models learn to reward hack, now is a perfect time to study these models in great detail. The failure mode is close enough to the real thing that we learn a lot and the models are still dumb enough that they don't do full-blown scheming.

thumb_up_off_alt48

chat_bubble_outline0

repeat4

shareShare

Arthur Conmy

@arthurconmy

2 months ago

Our circuits paper led by Dmitrii Kharlapenko and nev was accepted at ICML! The task seems a good one to study if you work on circuits 🙂

thumb_up_off_alt25

chat_bubble_outline0

repeat1

shareShare

Arthur Conmy

@arthurconmy

a month ago

Our high level finding in the Gemma Scope paper was that transcoders were slightly pareto worse than SAEs. But this all the weights are on HuggingFace if you want to look further into transcoders! They have other benefits that SAEs do not have

thumb_up_off_alt21

chat_bubble_outline0

repeat1

shareShare

Mikhail Terekhov

@miterekhov

a month ago

AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on cost. Our new paper introduces the Control Tax—how much does it cost to run the control protocols? (1/8) 🧵

thumb_up_off_alt66

chat_bubble_outline4

repeat18

shareShare

Arthur Conmy

@arthurconmy

a month ago

Last author of Gemini 2.5 😀

thumb_up_off_alt1,1K

chat_bubble_outline30

repeat12

shareShare

Arthur Conmy

@arthurconmy

a month ago

Our paper on chain of thought faithfulness has been updated. We made some changes we thought were worth it and also took feedback from twitter replies and changed some examples 🙂

thumb_up_off_alt13

chat_bubble_outline0

repeat0

shareShare

Arthur Conmy

@arthurconmy

19 days ago

'The key lesson from mechanistic interpretability is that a surprising number of AI behaviors are surprisingly well-described as linear directions in activation space' ~Lewis Smith We'll have more work in this area soon, thanks to Constantin Venhoff and Iván Arcuschin !!

thumb_up_off_alt33

chat_bubble_outline0

repeat0

shareShare