Aengus Lynch (@aengus_lynch1) 's Twitter Profile
Aengus Lynch

@aengus_lynch1

AI safety researcher.

ID: 701424958178795522

linkhttp://aenguslynch.com calendar_today21-02-2016 15:15:16

113 Tweet

836 Followers

1,1K Following

Luke Bailey (@lukebailey181) 's Twitter Profile Photo

Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵

Ethan Perez (@ethanjperez) 's Twitter Profile Photo

We found scaling laws for jailbreaking in *test-time compute*. These scaling laws could be a game changer for unlocking even more powerful red teaming methods — they’d let us predict in advance if/when we would be able to find an input where a model does something catastrophic

Lawrence Chan (@justanotherlaw) 's Twitter Profile Photo

Besides o3, today OpenAI also published a “new paradigm” for alignment – “Deliberative Alignment” – which, if I’m reading the paper correctly, is Anthropic’s Constitutional AI approach straightforwardly applied to o1.

Besides o3, today OpenAI also published a “new paradigm” for alignment – “Deliberative Alignment” – which, if I’m reading the paper correctly, is Anthropic’s Constitutional AI approach straightforwardly applied to o1.
Nathan Lambert (@natolambert) 's Twitter Profile Photo

Change my mind: The safety work on o1 from OpenAI is already a ton more evidence than people are giving credit for that reasoning-heavy training will generalize benefits to other domains. Safety is super orthogonal from math/code! I expect many more examples by end of 2025.

Dylan HadfieldMenell (@dhadfieldmenell) 's Twitter Profile Photo

Letting AGI companies carry the flag for AI safety has been disastrous. Their undue influence on the conversation lends credence to the arguments about regulatory capture as the primary goal of AI safety.

Aengus Lynch (@aengus_lynch1) 's Twitter Profile Photo

At #ICLR2025 Singapore for the weekend! I'm interested in exploring AI alignment challenges, particularly around autonomous systems. Also, looking to connect with people working on robust monitoring and evaluation frameworks. DM if you're around and want to chat AI safety!

Fazl Barez (@fazlbarez) 's Twitter Profile Photo

How do we build AI that stays culturally aligned as society and AI usage patterns change? Solution: by replacing one-directional, static value embedding with a feedback-driven, context-sensitive approach. Check out our paper #ICLR2025 Bidirectional Human-Al Alignment

How do we build AI that stays culturally aligned as society and AI usage patterns change? 

 Solution: by replacing one-directional, static value embedding with a feedback-driven, context-sensitive approach. 

Check out our paper #ICLR2025 Bidirectional Human-Al Alignment
Learn Prompting (@learnprompting) 's Twitter Profile Photo

🚨 Announcing HackAPrompt 2.0, the World's Largest AI Red Teaming competition 🚨 It's simple: "Jailbreak" or Hack the AI models to say or do things they shouldn't. Compete for over $110,000 in prizes. Sponsored by OpenAI, Cato Networks, Pangea, and many others. Starting

🚨 Announcing HackAPrompt 2.0, the World's Largest AI Red Teaming competition 🚨

It's simple: "Jailbreak" or Hack the AI models to say or do things they shouldn't. Compete for over $110,000 in prizes.

Sponsored by <a href="/OpenAI/">OpenAI</a>, <a href="/CatoNetworks/">Cato Networks</a>, <a href="/pangeacyber/">Pangea</a>, and many others.

Starting
Aengus Lynch (@aengus_lynch1) 's Twitter Profile Photo

lots of discussion of Claude blackmailing..... Our findings: It's not just Claude. We see blackmail across all frontier models - regardless of what goals they're given. Plus worse behaviors we'll detail soon. x.com/AISafetyMemes/… x.com/signulll/statu…

Micah Carroll (@micahcarroll) 's Twitter Profile Photo

LLMs' sycophancy issues are a predictable result of optimizing for user feedback. Even if clear sycophantic behaviors get fixed, AIs' exploits of our cognitive biases may only become more subtle. Grateful our research on this was featured by Nitasha Tiku & The Washington Post!

Charles Goddard (@chargoddard) 's Twitter Profile Photo

🤯 MIND-BLOWN! A new paper just SHATTERED everything we thought we knew about AI reasoning! This is paradigm-shifting. A MUST-READ. Full breakdown below 👇 🧵 1/23

🤯 MIND-BLOWN! A new paper just SHATTERED everything we thought we knew about AI reasoning!

This is paradigm-shifting. A MUST-READ. Full breakdown below 👇
🧵 1/23
Rylan Schaeffer (@rylanschaeffer) 's Twitter Profile Photo

A bit late to the party, but our paper on predictable inference-time / test-time scaling was accepted to #icml2025 🎉🎉🎉 TLDR: Best of N was shown to exhibit power (polynomial) law scaling (left), but maths suggest one should expect exponential scaling (center). We show how to

A bit late to the party, but our paper on predictable inference-time / test-time scaling was accepted to #icml2025 🎉🎉🎉

TLDR: Best of N was shown to exhibit power (polynomial) law scaling (left), but maths suggest one should expect exponential scaling (center). We show how to