Noah Y. Siegel (@noahysiegel) 's Twitter Profile
Noah Y. Siegel

@noahysiegel

Research Engineer @GoogleDeepMind. EA, vegan, giving 10% of my income to effective animal welfare charities. Let's make AGI go well for all sentient beings!

ID: 4549583660

calendar_today13-12-2015 18:51:32

23 Tweet

170 Followers

115 Following

Zac Kenton (@zackenton1) 's Twitter Profile Photo

Eventually, humans will need to supervise superhuman AI - but how? Can we study it now? We don't have superhuman AI, but we do have LLMs. We study protocols where a weaker LLM uses stronger ones to find better answers than it knows itself. Does this work? It’s complicated: 🧵👇

Eventually, humans will need to supervise superhuman AI - but how? Can we study it now?

We don't have superhuman AI, but we do have LLMs. We study protocols where a weaker LLM uses stronger ones to find better answers than it knows itself.

Does this work? It’s complicated: 🧵👇
Anca Dragan (@ancadianadragan) 's Twitter Profile Photo

So freaking proud of the AGI safety&alignment team -- read here a retrospective of the work over the past 1.5 years across frontier safety, oversight, interpretability, and more. Onwards! alignmentforum.org/posts/79BPxvSs…

Yoshua Bengio (@yoshua_bengio) 's Twitter Profile Photo

Employees of frontier AI labs are in a unique position to understand the potential impact of the most advanced AI models and their perspectives on this matter must be taken into account. I strongly encourage Governor Gavin Newsom to sign SB 1047 into law. calltolead.org

David Lindner (@davlindner) 's Twitter Profile Photo

New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵

New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward?

Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!

Inspired by myopic optimization but better performance – details in🧵
Zac Kenton (@zackenton1) 's Twitter Profile Photo

We're hiring for our Google DeepMind AGI Safety & Alignment and Gemini Safety teams. Locations: London, NYC, Mountain View, SF. Join us to help build safe AGI. Research Engineer boards.greenhouse.io/deepmind/jobs/…… Research Scientist boards.greenhouse.io/deepmind/jobs/…

Rohin Shah (@rohinmshah) 's Twitter Profile Photo

We're hiring! Join an elite team that sets an AGI safety approach for all of Google -- both through development and implementation of the Frontier Safety Framework (FSF), and through research that enables a future stronger FSF.

We're hiring! Join an elite team that sets an AGI safety approach for all of Google -- both through development and implementation of the Frontier Safety Framework (FSF), and through research that enables a future stronger FSF.
Arthur Conmy (@arthurconmy) 's Twitter Profile Photo

We are hiring Applied Interpretability researchers on the GDM Mech Interp Team!🧵 If interpretability is ever going to be useful, we need it to be applied at the frontier. Come work with Neel Nanda, the Google DeepMind AGI Safety team, and me: apply by 28th February as a

Fazl Barez (@fazlbarez) 's Twitter Profile Photo

Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵

Excited to share our paper: "Chain-of-Thought Is Not Explainability"! 

We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵
Chloe Li (@clippocampus) 's Twitter Profile Photo

Can LLMs covertly sandbag on capability evaluations against CoT monitoring? Find out on Saturday at the ICML Technical AI Governance @ ICML 2025 workshop! I’ll be giving a talk at 10:50 about work done on this by me, Mary Phuong and Noah Y. Siegel. Swing by and chat to me in-person/on DMs

Noah Y. Siegel (@noahysiegel) 's Twitter Profile Photo

Excited to be a SPAR mentor this Fall, come work with me on figuring out how to measure explanatory faithfulness for LLMs!