Anca Dragan (@ancadianadragan) 's Twitter Profile
Anca Dragan

@ancadianadragan

sr director of AI safety & alignment at Google DeepMind • associate professor at UC Berkeley EECS • proud mom of an amazing 3yr old

ID: 978823301802819584

linkhttp://www.ancadragan.com calendar_today28-03-2018 02:37:15

308 Tweet

11,11K Followers

185 Following

Vivek Myers (@vivek_myers) 's Twitter Profile Photo

Current robot learning methods are good at imitating tasks seen during training, but struggle to compose behaviors in new ways. When training imitation policies, we found something surprising—using temporally-aligned task representations enabled compositional generalization. 1/

Vivek Myers (@vivek_myers) 's Twitter Profile Photo

What does temporal alignment mean? When training, our policy imitates the human actions that lead to the end goal 𝑔 of a trajectory. Rather than training on the raw goals, we use a representation 𝜓(𝑔) that aligns with the preceding state “successor features” 𝜙(𝑠). 2/

What does temporal alignment mean? When training, our policy imitates the human actions that lead to the end goal 𝑔 of a trajectory. Rather than training on the raw goals, we use a representation 𝜓(𝑔) that aligns with the preceding state “successor features” 𝜙(𝑠). 2/
Rohin Shah (@rohinmshah) 's Twitter Profile Photo

New release! Great for a short, high-level overview of a variety of different areas within AGI safety that we're excited about. x.com/vkrakovna/stat…

Victoria Krakovna (@vkrakovna) 's Twitter Profile Photo

Learn more about AGI safety in a new short course from Google DeepMind alignment team! Check out the short 5-minute videos in thread, or see full playlist here youtube.com/playlist?list=…

Anca Dragan (@ancadianadragan) 's Twitter Profile Photo

we're releasing a short course on agi safety & alignment -- hope it helps! thanks team for pulling it together, special thanks to Victoria Krakovna deepmindsafetyresearch.medium.com/introducing-ou…

Pieter Abbeel (@pabbeel) 's Twitter Profile Photo

Founders who were PhD or post-doc in my lab at Berkeley, **largely funded by NSF / DoD grants**, start-up, market cap (collected by OpenAI Deep Research)

Founders who were PhD or post-doc in my lab at Berkeley, **largely funded by NSF / DoD grants**, start-up, market cap (collected by OpenAI Deep Research)
Anca Dragan (@ancadianadragan) 's Twitter Profile Photo

The native image generation launch was a lot of work from a safety POV. But I'm so happy we got this functionality out, check this out:

The native image generation launch was a lot of work from a safety POV. But I'm so happy we got this functionality out, check this out:
Cassidy Laidlaw (@cassidy_laidlaw) 's Twitter Profile Photo

We built an AI assistant that plays Minecraft with you. Start building a house—it figures out what you’re doing and jumps in to help. This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧵

Cassidy Laidlaw (@cassidy_laidlaw) 's Twitter Profile Photo

RLHF is great but it encourages short-term optimization: trying to solve the user's entire problem in a single response. For example, if you ask ChatGPT to "clean up some disk space," it will immediately give you a program to run without asking which files are okay to delete!

RLHF is great but it encourages short-term optimization: trying to solve the user's entire problem in a single response. For example, if you ask ChatGPT to "clean up some disk space," it will immediately give you a program to run without asking which files are okay to delete!
Cassidy Laidlaw (@cassidy_laidlaw) 's Twitter Profile Photo

A better assistant would maintain *uncertainty* about its goal and ask clarification questions until it really understood, leading to a better solution. Assistance games can enable this.

Cassidy Laidlaw (@cassidy_laidlaw) 's Twitter Profile Photo

Unlike RLHF, assistance games explicitly treat the user-assistant interaction as a two player game, where the user knows their goal but the assistant doesn't. AGs model *communication* about the goal from the user to the assistant and *collaboration* between them to achieve it.

Unlike RLHF, assistance games explicitly treat the user-assistant interaction as a two player game, where the user knows their goal but the assistant doesn't. AGs model *communication* about the goal from the user to the assistant and *collaboration* between them to achieve it.
Cassidy Laidlaw (@cassidy_laidlaw) 's Twitter Profile Photo

Our new RL algorithm, AssistanceZero, trains an assistant that displays emergent helpful behaviors like *active learning* and *learning from corrections*.

Anca Dragan (@ancadianadragan) 's Twitter Profile Photo

Per our Frontier Safety Framework, we continue to test our models for critical capabilities. Here’s the updated model card for Gemini 2.5Pro with frontier safety evaluations + explanation of how our safety buffer / alert thresholds approach applies to 2.0, 2.5, and what’s coming.