Tomek Korbak (@tomekkorbak) 's Twitter Profile
Tomek Korbak

@tomekkorbak

senior research scientist @AISecurityInst | previously @AnthropicAI @nyuniversity @SussexUni

ID: 871354083151667200

linkhttp://tomekkorbak.com calendar_today04-06-2017 13:12:57

1,1K Tweet

2,2K Followers

511 Following

Miles Turpin (@milesaturpin) 's Twitter Profile Photo

New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).

New @Scale_AI paper! 🌟

LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).
Bowen Baker (@bobabowen) 's Twitter Profile Photo

I am grateful to have worked closely with Tomek Korbak, Mikita Balesni 🇺🇦, Rohin Shah and Vlad Mikulik on this paper, and I am very excited that researchers across many prominent AI institutions collaborated with us and came to consensus around this important direction.

Neel Nanda (@neelnanda5) 's Twitter Profile Photo

It was great to be part of this statement. I wholeheartedly agree. It is a wild lucky coincidence that models often express dangerous intentions aloud, and it would be foolish to waste this opportunity. It is crucial to keep chain of thought monitorable as long as possible

Wojciech Zaremba (@woj_zaremba) 's Twitter Profile Photo

When models start reasoning step-by-step, we suddenly get a huge safety gift: a window into their thought process. We could easily lose this if we're not careful. We're publishing a paper urging frontier labs: please don't train away this monitorability. Authored and endorsed

When models start reasoning step-by-step, we suddenly get a huge safety gift: a window into their thought process.

We could easily lose this if we're not careful.

We're publishing a paper urging frontier labs: please don't train away this monitorability.

Authored and endorsed
Daniel Kokotajlo (@dkokotajlo) 's Twitter Profile Photo

I'm very happy to see this happen. I think that we're in a vastly better position to solve the alignment problem if we can see what our AIs are thinking, and I think that we sorta mostly can right now, but that by default in the future companies will move away from this paradigm

Owain Evans (@owainevans_uk) 's Twitter Profile Photo

New position paper on Chain of Thought monitoring that I'm excited to be a (small) part of. This is related to our recent work showing that emergently misaligned models sometimes articulate their misaligned plans in their CoT. x.com/OwainEvans_UK/…

Gary Marcus (@garymarcus) 's Twitter Profile Photo

truly fascinating win for neurosymbolic AI, raising deep questions about the evolution of human cognition. long chains of cognition must be translated into words [symbols!] - and not just transit through points in embedding space. incredibly interesting.

Jason Wei (@_jasonwei) 's Twitter Profile Photo

Becoming an RL diehard in the past year and thinking about RL for most of my waking hours inadvertently taught me an important lesson about how to live my own life. One of the big concepts in RL is that you always want to be “on-policy”: instead of mimicking other people’s

AI Security Institute (@aisecurityinst) 's Twitter Profile Photo

The monitorability of chain of thought is an exciting opportunity for AI safety. But as models get more powerful, it could require ongoing, active commitments to preserve. We’re excited to collaborate with Apollo Research and many authors from frontier labs on this position paper.

Toby Ord (@tobyordoxford) 's Twitter Profile Photo

The fact that frontier AI agents subvocalise their plans in English is an absolute gift for AI safety — a quirk of the technology development which may have done more to protect us from misaligned AGI than any technique we've deliberately developed. Don't squander this gift.