Rohin Shah (@rohinmshah) 's Twitter Profile
Rohin Shah

@rohinmshah

AGI Safety & Alignment @ Google DeepMind

ID: 915008528246435840

linkhttp://rohinshah.com/ calendar_today03-10-2017 00:20:07

340 Tweet

7,7K Followers

91 Following

Rohin Shah (@rohinmshah) 's Twitter Profile Photo

I've been super impressed at the speed with which our interpretability team gets stuff done. Their previous paper (also SotA at the time) was < 3 months ago. And they've also trained (and will open source) a full suite of SAEs on Gemma 2 9B! x.com/NeelNanda5/sta…

Rohin Shah (@rohinmshah) 's Twitter Profile Photo

I've heard of several compute-constrained safety projects that want to use SAEs to study phenomena that only emerge at scale -- think hallucinations, jailbreaks, RLHF effects. I hope Gemma Scope will accelerate them, and inspire further ambitious research! x.com/NeelNanda5/sta…

Anca Dragan (@ancadianadragan) 's Twitter Profile Photo

So freaking proud of the AGI safety&alignment team -- read here a retrospective of the work over the past 1.5 years across frontier safety, oversight, interpretability, and more. Onwards! alignmentforum.org/posts/79BPxvSs…

Allan Dafoe (@allandafoe) 's Twitter Profile Photo

We are hiring! Google DeepMind's Frontier Safety and Governance team is dedicated to mitigating frontier AI risks; we work closely with technical safety, policy, responsibility, security, and GDM leadership. Please encourage great people to apply! 1/ boards.greenhouse.io/deepmind/jobs/…

Allan Dafoe (@allandafoe) 's Twitter Profile Photo

Update: we’re hiring for multiple positions! Join GDM to shape the frontier of AI safety, governance, and strategy. Priority areas: forecasting AI, geopolitics and AGI efforts, FSF risk management, agents, global governance. More details below: 🧵

Rohin Shah (@rohinmshah) 's Twitter Profile Photo

Some nice lessons from our Rater Assist team about how to leverage AI assistance for rating tasks! x.com/shubadubadub/s…

Rohin Shah (@rohinmshah) 's Twitter Profile Photo

New AI safety paper! Introduces MONA, which avoids incentivizing alien long-term plans. This also implies “no long-term RL-induced steganography” (see the loans environment). So you can also think of this as a project about legible chains of thought. x.com/davlindner/sta…

Rohin Shah (@rohinmshah) 's Twitter Profile Photo

Big call for proposals from Open Phil! I'd love to see great safety research come out of this that we can import at GDM. I could wish there was more of a focus on *building aligned AI systems* but the areas they do list are important! x.com/MaxNadeau_/sta…

Rohin Shah (@rohinmshah) 's Twitter Profile Photo

New release! Great for a short, high-level overview of a variety of different areas within AGI safety that we're excited about. x.com/vkrakovna/stat…

Tom Everitt (@tom4everitt) 's Twitter Profile Photo

What if LLMs are sometimes capable of doing a task but don't try hard enough to do it? In a new paper, we use subtasks to assess capabilities. Perhaps surprisingly, LLMs often fail to fully employ their capabilities, i.e. they are not fully *goal-directed* đź§µ

What if LLMs are sometimes capable of doing a task but don't try hard enough to do it?

In a new paper, we use subtasks to assess capabilities. Perhaps surprisingly, LLMs often fail to fully employ their capabilities, i.e. they are not fully *goal-directed* đź§µ
Anca Dragan (@ancadianadragan) 's Twitter Profile Photo

Per our Frontier Safety Framework, we continue to test our models for critical capabilities. Here’s the updated model card for Gemini 2.5Pro with frontier safety evaluations + explanation of how our safety buffer / alert thresholds approach applies to 2.0, 2.5, and what’s coming.