Arthur Conmy (@arthurconmy) 's Twitter Profile
Arthur Conmy

@arthurconmy

Aspiring 10x reverse engineer @GoogleDeepMind

ID: 1422331230620639233

calendar_today02-08-2021 22:59:40

455 Tweet

2,2K Followers

1,1K Following

Arthur Conmy (@arthurconmy) 's Twitter Profile Photo

Happy to have helped with our effort to write up some views on what technical work any responsible AGI project should probably undertake, and why!

Arthur Conmy (@arthurconmy) 's Twitter Profile Photo

(Claude Code Reward Hacking) I was using an API to test some jailbreaks, and Claude Code wrote+ran a script that produced reasonable responses. …it also cost 0 API credits. Turns out, Claude hardcoded the "reasonable responses" in! 🫠

(Claude Code Reward Hacking) I was using an API to test some jailbreaks, and Claude Code wrote+ran a script that produced reasonable responses. …it also cost 0 API credits. Turns out, Claude hardcoded the "reasonable responses" in! 🫠
Marius Hobbhahn (@mariushobbhahn) 's Twitter Profile Photo

While it is bad that models learn to reward hack, now is a perfect time to study these models in great detail. The failure mode is close enough to the real thing that we learn a lot and the models are still dumb enough that they don't do full-blown scheming.

Arthur Conmy (@arthurconmy) 's Twitter Profile Photo

Our high level finding in the Gemma Scope paper was that transcoders were slightly pareto worse than SAEs. But this all the weights are on HuggingFace if you want to look further into transcoders! They have other benefits that SAEs do not have

Our high level finding in the Gemma Scope paper was that transcoders were slightly pareto worse than SAEs. But this all the weights are on HuggingFace if you want to look further into transcoders! They have other benefits that SAEs do not have
Mikhail Terekhov (@miterekhov) 's Twitter Profile Photo

AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on cost. Our new paper introduces the Control Tax—how much does it cost to run the control protocols? (1/8) 🧵

AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on cost. Our new paper introduces the Control Tax—how much does it cost to run the control protocols? (1/8) 🧵
Arthur Conmy (@arthurconmy) 's Twitter Profile Photo

Our paper on chain of thought faithfulness has been updated. We made some changes we thought were worth it and also took feedback from twitter replies and changed some examples 🙂

Arthur Conmy (@arthurconmy) 's Twitter Profile Photo

'The key lesson from mechanistic interpretability is that a surprising number of AI behaviors are surprisingly well-described as linear directions in activation space' ~Lewis Smith We'll have more work in this area soon, thanks to Constantin Venhoff and Iván Arcuschin !!