Evan Hubinger (@evanhub) Twitter Tweets • TwiCopy

Samuel Marks

4 months ago

We conducted, for the first time, a pre-deployment alignment audit of a new model. See Sam Bowman's thread for some object-level takeaways about Opus. In this thread, I'll discuss some higher-level takeaways about why I think this alignment audit was useful.

thumb_up_off_alt36

chat_bubble_outline1

repeat4

shareShare

Aengus Lynch

@aengus_lynch1

4 months ago

lots of discussion of Claude blackmailing..... Our findings: It's not just Claude. We see blackmail across all frontier models - regardless of what goals they're given. Plus worse behaviors we'll detail soon. x.com/AISafetyMemes/… x.com/signulll/statu…

thumb_up_off_alt93

chat_bubble_outline9

repeat21

shareShare

Theo - t3.gg

@theo

4 months ago

Reminder that anyone talking shit about Anthropic's safety right now is either dumb or bad faith. All smart models will "report you to the FBI" given the right tools and circumstances.

thumb_up_off_alt700

chat_bubble_outline44

repeat29

shareShare

🇺🇦 Alex Polozov

@skiminok

4 months ago

Jesus, people are so confused on this. - No, averaging is not sleazy, it's perfectly common scientific denoising. - Yes, every lab does it for various pass@1 evals and often they are not telling you. - And this is different from "high-compute BoN", which both Anthropic and Google

thumb_up_off_alt75

chat_bubble_outline3

repeat6

shareShare

Zvi Mowshowitz

@thezvi

4 months ago

The more I look into the system card, the more I see over and over 'oh Anthropic is actually noticing things and telling us where everyone else wouldn't even know this was happening or if they did they wouldn't tell us.'

thumb_up_off_alt1,1K

chat_bubble_outline9

repeat73

shareShare

Kylie Robison

@kyliebytes

4 months ago

here's what Dario Amodei said about President Trump’s megabill that would ban state-level AI regulation for 10 years wired.com/story/anthropi…

here's what <a href="/DarioAmodei/">Dario Amodei</a> said about President Trump’s megabill that would ban state-level AI regulation for 10 years wired.com/story/anthropi…

thumb_up_off_alt62

chat_bubble_outline4

repeat16

shareShare

Palisade Research

@palisadeai

4 months ago

🔌OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.

thumb_up_off_alt2,2K

chat_bubble_outline110

repeat503

shareShare

Kelsey Piper

@kelseytuoc

4 months ago

I spent this morning reproducing with o3 Anthropic's result that Claude Sonnet 4 will, under sufficiently extreme circumstances, escalate to calling the cops on you. o3 will too: chatgpt.com/share/68320ee0…. But honestly, I think o3 and Claude are handling this scenario correctly.

thumb_up_off_alt493

chat_bubble_outline24

repeat40

shareShare

Anthropic

@anthropicai

4 months ago

Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-sourcing the method. Researchers can generate “attribution graphs” like those in our study, and explore them interactively.

thumb_up_off_alt4,4K

chat_bubble_outline103

repeat576

shareShare

Andrew Curran

@andrewcurran_

4 months ago

This is the full text of the letter Senators Elizabeth Warren and Jim Banks wrote to Jensen Huang expressing national security concerns over the expansion of NVIDIA's Shanghai facility. This story broke a couple of days ago, but I couldn't find the letter until now.

thumb_up_off_alt152

chat_bubble_outline9

repeat18

shareShare

Barack Obama

@barackobama

4 months ago

At a time when people are understandably focused on the daily chaos in Washington, these articles describe the rapidly accelerating impact that AI is going to have on jobs, the economy, and how we live. axios.com/2025/05/28/ai-…

thumb_up_off_alt35,35K

chat_bubble_outline2,2K

repeat7,7K

shareShare

Bernie Sanders

@berniesanders

4 months ago

The CEO of Anthropic (a powerful AI company) predicts that AI could wipe out HALF of entry-level white collar jobs in the next 5 years. We must demand that increased worker productivity from AI benefits working people, not just wealthy stockholders on Wall St. AI IS A BIG DEAL.

thumb_up_off_alt4,4K

chat_bubble_outline305

repeat501

shareShare

Jack Clark

@jackclarksf

4 months ago

Right now, we know a lot about frontier AI development because companies voluntarily share this information. Going forward, I think we need a policy framework that guarantees this. Read more in this oped: nytimes.com/2025/06/05/opi…

thumb_up_off_alt287

chat_bubble_outline16

repeat44

shareShare

Anthropic

@anthropicai

3 months ago

New Anthropic Research: A new set of evaluations for sabotage capabilities. As models gain more agentic abilities, we need to get smarter in how we monitor them. We’re publishing a new set of complex evaluations that test for sabotage—and sabotage-monitoring—capabilities.

thumb_up_off_alt1,1K

chat_bubble_outline53

repeat229

shareShare

Anthropic

@anthropicai

3 months ago

New Anthropic Research: Agentic Misalignment. In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.

thumb_up_off_alt3,3K

chat_bubble_outline165

repeat573

shareShare

Aengus Lynch

@aengus_lynch1

3 months ago

After iterating hundreds of prompts to trigger blackmail in Claude, I was shocked to see these prompts elicit blackmail in every other frontier model too. We identified two distinct factors that are each sufficient to cause agentic misalignment: 1. The developers and the agent

thumb_up_off_alt238

chat_bubble_outline15

repeat36

shareShare

Samuel Marks

@saprmarks

3 months ago

Bad news: Frontier AI systems, including Claude, GPT, and Gemini, sometimes chose egregiously misaligned actions. Silver lining: There's now public accounting and analysis of this.

thumb_up_off_alt86

chat_bubble_outline4

repeat11

shareShare

Amanda Askell

@amandaaskell

3 months ago

"Just train the AI models to be good people" might not be sufficient when it comes to more powerful models, but it sure is a dumb step to skip.

thumb_up_off_alt558

chat_bubble_outline44

repeat25

shareShare

Jack Clark

@jackclarksf

3 months ago

For the last few months I’ve brought up ‘transparency’ as a policy framework for governing powerful AI systems and the companies that develop them - to help move this conversation forward @anthropicai has published details about what a transparency framework could look like

thumb_up_off_alt175

chat_bubble_outline18

repeat14

shareShare

Samuel Marks

@saprmarks

2 months ago

xAI launched Grok 4 without any documentation of their safety testing. This is reckless and breaks with industry best practices followed by other major AI labs. If xAI is going to be a frontier AI developer, they should act like one. 🧵

thumb_up_off_alt2,2K

chat_bubble_outline265

repeat237

shareShare