Tomek Korbak (@tomekkorbak) Twitter Tweets • TwiCopy

Tomek Korbak

@tomekkorbak

+ Follow

senior research scientist @AISecurityInst | previously @AnthropicAI @nyuniversity @SussexUni

ID: 871354083151667200

linkhttp://tomekkorbak.com calendar_today04-06-2017 13:12:57

1,1K Tweet

2,2K Followers

511 Following

Tomek Korbak

@tomekkorbak

4 months ago

What if AI safety was as simple as reading what models write when they think?

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).

thumb_up_off_alt217

chat_bubble_outline7

repeat36

shareShare

Bowen Baker

@bobabowen

4 months ago

I am grateful to have worked closely with Tomek Korbak, Mikita Balesni 🇺🇦, Rohin Shah and Vlad Mikulik on this paper, and I am very excited that researchers across many prominent AI institutions collaborated with us and came to consensus around this important direction.

thumb_up_off_alt28

chat_bubble_outline0

repeat3

shareShare

Neel Nanda

@neelnanda5

4 months ago

It was great to be part of this statement. I wholeheartedly agree. It is a wild lucky coincidence that models often express dangerous intentions aloud, and it would be foolish to waste this opportunity. It is crucial to keep chain of thought monitorable as long as possible

thumb_up_off_alt117

chat_bubble_outline4

repeat9

shareShare

Wojciech Zaremba

@woj_zaremba

4 months ago

When models start reasoning step-by-step, we suddenly get a huge safety gift: a window into their thought process. We could easily lose this if we're not careful. We're publishing a paper urging frontier labs: please don't train away this monitorability. Authored and endorsed

thumb_up_off_alt194

chat_bubble_outline8

repeat23

shareShare

Daniel Kokotajlo

@dkokotajlo

4 months ago

I'm very happy to see this happen. I think that we're in a vastly better position to solve the alignment problem if we can see what our AIs are thinking, and I think that we sorta mostly can right now, but that by default in the future companies will move away from this paradigm

thumb_up_off_alt175

chat_bubble_outline7

repeat12

shareShare

Owain Evans

@owainevans_uk

4 months ago

New position paper on Chain of Thought monitoring that I'm excited to be a (small) part of. This is related to our recent work showing that emergently misaligned models sometimes articulate their misaligned plans in their CoT. x.com/OwainEvans_UK/…

thumb_up_off_alt40

chat_bubble_outline1

repeat5

shareShare

Gary Marcus

@garymarcus

4 months ago

truly fascinating win for neurosymbolic AI, raising deep questions about the evolution of human cognition. long chains of cognition must be translated into words [symbols!] - and not just transit through points in embedding space. incredibly interesting.

thumb_up_off_alt74

chat_bubble_outline5

repeat15

shareShare

Jason Wei

@_jasonwei

4 months ago

Becoming an RL diehard in the past year and thinking about RL for most of my waking hours inadvertently taught me an important lesson about how to live my own life. One of the big concepts in RL is that you always want to be “on-policy”: instead of mimicking other people’s

thumb_up_off_alt1,1K

chat_bubble_outline74

repeat129

shareShare

AI Security Institute

@aisecurityinst

4 months ago

The monitorability of chain of thought is an exciting opportunity for AI safety. But as models get more powerful, it could require ongoing, active commitments to preserve. We’re excited to collaborate with Apollo Research and many authors from frontier labs on this position paper.

thumb_up_off_alt18

chat_bubble_outline0

repeat2

shareShare

Tomek Korbak

@tomekkorbak

4 months ago

thumb_up_off_alt39

chat_bubble_outline1

repeat3

shareShare

Toby Ord

@tobyordoxford

4 months ago

The fact that frontier AI agents subvocalise their plans in English is an absolute gift for AI safety — a quirk of the technology development which may have done more to protect us from misaligned AGI than any technique we've deliberately developed. Don't squander this gift.

thumb_up_off_alt108

chat_bubble_outline6

repeat11

shareShare

Asa Cooper Stickland

@asacoopstick

4 months ago

The reasoning traces feel very encoded/RL-y!

thumb_up_off_alt19

chat_bubble_outline1

repeat1

shareShare

Tomek Korbak

Tomek Korbak

Miles Turpin

Bowen Baker

Neel Nanda

Wojciech Zaremba

Daniel Kokotajlo

Owain Evans

Gary Marcus

Jason Wei

AI Security Institute

Tomek Korbak

Toby Ord

Asa Cooper Stickland