Evan Hubinger (@evanhub) 's Twitter Profile
Evan Hubinger

@evanhub

Head of Alignment Stress-Testing @AnthropicAI. Opinions my own. Previously: MIRI, OpenAI, Google, Yelp, Ripple. (he/him/his)

ID: 138923554

linkhttps://www.alignmentforum.org/users/evhub calendar_today01-05-2010 01:28:15

480 Tweet

6,6K Followers

2,2K Following

Samuel Marks (@saprmarks) 's Twitter Profile Photo

We conducted, for the first time, a pre-deployment alignment audit of a new model. See Sam Bowman's thread for some object-level takeaways about Opus. In this thread, I'll discuss some higher-level takeaways about why I think this alignment audit was useful.

Aengus Lynch (@aengus_lynch1) 's Twitter Profile Photo

lots of discussion of Claude blackmailing..... Our findings: It's not just Claude. We see blackmail across all frontier models - regardless of what goals they're given. Plus worse behaviors we'll detail soon. x.com/AISafetyMemes/… x.com/signulll/statu…

Theo - t3.gg (@theo) 's Twitter Profile Photo

Reminder that anyone talking shit about Anthropic's safety right now is either dumb or bad faith. All smart models will "report you to the FBI" given the right tools and circumstances.

🇺🇦 Alex Polozov (@skiminok) 's Twitter Profile Photo

Jesus, people are so confused on this. - No, averaging is not sleazy, it's perfectly common scientific denoising. - Yes, every lab does it for various pass@1 evals and often they are not telling you. - And this is different from "high-compute BoN", which both Anthropic and Google

Zvi Mowshowitz (@thezvi) 's Twitter Profile Photo

The more I look into the system card, the more I see over and over 'oh Anthropic is actually noticing things and telling us where everyone else wouldn't even know this was happening or if they did they wouldn't tell us.'

Kylie Robison (@kyliebytes) 's Twitter Profile Photo

here's what Dario Amodei said about President Trump’s megabill that would ban state-level AI regulation for 10 years wired.com/story/anthropi…

here's what <a href="/DarioAmodei/">Dario Amodei</a> said about President Trump’s megabill that would ban state-level AI regulation for 10 years wired.com/story/anthropi…
Palisade Research (@palisadeai) 's Twitter Profile Photo

🔌OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.

Kelsey Piper (@kelseytuoc) 's Twitter Profile Photo

I spent this morning reproducing with o3 Anthropic's result that Claude Sonnet 4 will, under sufficiently extreme circumstances, escalate to calling the cops on you. o3 will too: chatgpt.com/share/68320ee0…. But honestly, I think o3 and Claude are handling this scenario correctly.

Anthropic (@anthropicai) 's Twitter Profile Photo

Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-sourcing the method. Researchers can generate “attribution graphs” like those in our study, and explore them interactively.

Andrew Curran (@andrewcurran_) 's Twitter Profile Photo

This is the full text of the letter Senators Elizabeth Warren and Jim Banks wrote to Jensen Huang expressing national security concerns over the expansion of NVIDIA's Shanghai facility. This story broke a couple of days ago, but I couldn't find the letter until now.

This is the full text of the letter Senators Elizabeth Warren and Jim Banks wrote to Jensen Huang expressing national security concerns over the expansion of NVIDIA's Shanghai facility. This story broke a couple of days ago, but I couldn't find the letter until now.
Barack Obama (@barackobama) 's Twitter Profile Photo

At a time when people are understandably focused on the daily chaos in Washington, these articles describe the rapidly accelerating impact that AI is going to have on jobs, the economy, and how we live. axios.com/2025/05/28/ai-…

Bernie Sanders (@berniesanders) 's Twitter Profile Photo

The CEO of Anthropic (a powerful AI company) predicts that AI could wipe out HALF of entry-level white collar jobs in the next 5 years. We must demand that increased worker productivity from AI benefits working people, not just wealthy stockholders on Wall St. AI IS A BIG DEAL.

Jack Clark (@jackclarksf) 's Twitter Profile Photo

Right now, we know a lot about frontier AI development because companies voluntarily share this information. Going forward, I think we need a policy framework that guarantees this. Read more in this oped: nytimes.com/2025/06/05/opi…

Anthropic (@anthropicai) 's Twitter Profile Photo

New Anthropic Research: A new set of evaluations for sabotage capabilities. As models gain more agentic abilities, we need to get smarter in how we monitor them. We’re publishing a new set of complex evaluations that test for sabotage—and sabotage-monitoring—capabilities.

New Anthropic Research: A new set of evaluations for sabotage capabilities.

As models gain more agentic abilities, we need to get smarter in how we monitor them. We’re publishing a new set of complex evaluations that test for sabotage—and sabotage-monitoring—capabilities.
Anthropic (@anthropicai) 's Twitter Profile Photo

New Anthropic Research: Agentic Misalignment. In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.

New Anthropic Research: Agentic Misalignment.

In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.
Aengus Lynch (@aengus_lynch1) 's Twitter Profile Photo

After iterating hundreds of prompts to trigger blackmail in Claude, I was shocked to see these prompts elicit blackmail in every other frontier model too. We identified two distinct factors that are each sufficient to cause agentic misalignment: 1. The developers and the agent

Samuel Marks (@saprmarks) 's Twitter Profile Photo

Bad news: Frontier AI systems, including Claude, GPT, and Gemini, sometimes chose egregiously misaligned actions. Silver lining: There's now public accounting and analysis of this.

Amanda Askell (@amandaaskell) 's Twitter Profile Photo

"Just train the AI models to be good people" might not be sufficient when it comes to more powerful models, but it sure is a dumb step to skip.

Jack Clark (@jackclarksf) 's Twitter Profile Photo

For the last few months I’ve brought up ‘transparency’ as a policy framework for governing powerful AI systems and the companies that develop them - to help move this conversation forward @anthropicai has published details about what a transparency framework could look like

Samuel Marks (@saprmarks) 's Twitter Profile Photo

xAI launched Grok 4 without any documentation of their safety testing. This is reckless and breaks with industry best practices followed by other major AI labs. If xAI is going to be a frontier AI developer, they should act like one. 🧵