Rohin Shah (@rohinmshah) Twitter Tweets • TwiCopy

Rohin Shah

a year ago

I've been super impressed at the speed with which our interpretability team gets stuff done. Their previous paper (also SotA at the time) was < 3 months ago. And they've also trained (and will open source) a full suite of SAEs on Gemma 2 9B! x.com/NeelNanda5/sta…

thumb_up_off_alt44

chat_bubble_outline0

repeat2

shareShare

Rohin Shah

@rohinmshah

a year ago

I've heard of several compute-constrained safety projects that want to use SAEs to study phenomena that only emerge at scale -- think hallucinations, jailbreaks, RLHF effects. I hope Gemma Scope will accelerate them, and inspire further ambitious research! x.com/NeelNanda5/sta…

thumb_up_off_alt71

chat_bubble_outline1

repeat7

shareShare

Anca Dragan

@ancadianadragan

a year ago

So freaking proud of the AGI safety&alignment team -- read here a retrospective of the work over the past 1.5 years across frontier safety, oversight, interpretability, and more. Onwards! alignmentforum.org/posts/79BPxvSs…

thumb_up_off_alt326

chat_bubble_outline7

repeat62

shareShare

Allan Dafoe

@allandafoe

a year ago

We are hiring! Google DeepMind's Frontier Safety and Governance team is dedicated to mitigating frontier AI risks; we work closely with technical safety, policy, responsibility, security, and GDM leadership. Please encourage great people to apply! 1/ boards.greenhouse.io/deepmind/jobs/…

thumb_up_off_alt158

chat_bubble_outline3

repeat47

shareShare

Allan Dafoe

@allandafoe

a year ago

Update: we’re hiring for multiple positions! Join GDM to shape the frontier of AI safety, governance, and strategy. Priority areas: forecasting AI, geopolitics and AGI efforts, FSF risk management, agents, global governance. More details below: 🧵

thumb_up_off_alt144

chat_bubble_outline2

repeat20

shareShare

Rohin Shah

@rohinmshah

9 months ago

Some nice lessons from our Rater Assist team about how to leverage AI assistance for rating tasks! x.com/shubadubadub/s…

thumb_up_off_alt11

chat_bubble_outline0

repeat2

shareShare

Rohin Shah

@rohinmshah

8 months ago

New AI safety paper! Introduces MONA, which avoids incentivizing alien long-term plans. This also implies “no long-term RL-induced steganography” (see the loans environment). So you can also think of this as a project about legible chains of thought. x.com/davlindner/sta…

thumb_up_off_alt88

chat_bubble_outline0

repeat13

shareShare

Rohin Shah

@rohinmshah

8 months ago

Now with an approach to deceptive alignment -- first such policy to do so! x.com/GoogleDeepMind…

thumb_up_off_alt114

chat_bubble_outline2

repeat12

shareShare

Rohin Shah

@rohinmshah

8 months ago

Big call for proposals from Open Phil! I'd love to see great safety research come out of this that we can import at GDM. I could wish there was more of a focus on *building aligned AI systems* but the areas they do list are important! x.com/MaxNadeau_/sta…

thumb_up_off_alt44

chat_bubble_outline2

repeat3

shareShare

Rohin Shah

@rohinmshah

7 months ago

New release! Great for a short, high-level overview of a variety of different areas within AGI safety that we're excited about. x.com/vkrakovna/stat…

thumb_up_off_alt30

chat_bubble_outline1

repeat5

shareShare

Rohin Shah

@rohinmshah

7 months ago

More details on one of the roles we're hiring for on the GDM safety team x.com/ArthurConmy/st…

thumb_up_off_alt10

chat_bubble_outline1

repeat1

shareShare

Tom Everitt

@tom4everitt

5 months ago

What if LLMs are sometimes capable of doing a task but don't try hard enough to do it? In a new paper, we use subtasks to assess capabilities. Perhaps surprisingly, LLMs often fail to fully employ their capabilities, i.e. they are not fully *goal-directed* 🧵

thumb_up_off_alt237

chat_bubble_outline23

repeat44

shareShare

Anca Dragan

@ancadianadragan

5 months ago

Per our Frontier Safety Framework, we continue to test our models for critical capabilities. Here’s the updated model card for Gemini 2.5Pro with frontier safety evaluations + explanation of how our safety buffer / alert thresholds approach applies to 2.0, 2.5, and what’s coming.

thumb_up_off_alt79

chat_bubble_outline1

repeat13

shareShare