Aengus Lynch (@aengus_lynch1) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Visit us at #NeurIPS2024 11am-2pm today! ⚙️Analysing the Generalisation and Reliability of Steering Vectors 🎓Daniel Tan David Chanin, Aengus Lynch Dimitrios Kanoulas Brooks Paige Adrià Garriga-Alonso Robert Kirk neurips.cc/virtual/2024/p…

Visit us at #NeurIPS2024 11am-2pm today!
⚙️Analysing the Generalisation and Reliability of Steering Vectors
🎓<a href="/DanielCHTan97/">Daniel Tan</a> <a href="/chanindav/">David Chanin</a>, <a href="/aengus_lynch1/">Aengus Lynch</a> <a href="/dkanou/">Dimitrios Kanoulas</a> Brooks Paige <a href="/AdriGarriga/">Adrià Garriga-Alonso</a> <a href="/_robertkirk/">Robert Kirk</a>
neurips.cc/virtual/2024/p…

thumb_up_off_alt13

chat_bubble_outline1

repeat6

shareShare

Luke Bailey

@lukebailey181

8 months ago

Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵

thumb_up_off_alt366

chat_bubble_outline11

repeat83

shareShare

Ethan Perez

@ethanjperez

8 months ago

We found scaling laws for jailbreaking in *test-time compute*. These scaling laws could be a game changer for unlocking even more powerful red teaming methods — they’d let us predict in advance if/when we would be able to find an input where a model does something catastrophic

thumb_up_off_alt86

chat_bubble_outline2

repeat4

shareShare

Lawrence Chan

@justanotherlaw

8 months ago

Besides o3, today OpenAI also published a “new paradigm” for alignment – “Deliberative Alignment” – which, if I’m reading the paper correctly, is Anthropic’s Constitutional AI approach straightforwardly applied to o1.

thumb_up_off_alt373

chat_bubble_outline5

repeat31

shareShare

Dan Hendrycks

@danhendrycks

7 months ago

It looks like China has roughly caught up. Any AI strategy that depends on a lasting U.S. lead is fragile.

thumb_up_off_alt493

chat_bubble_outline32

repeat51

shareShare

Nathan Lambert

@natolambert

7 months ago

Change my mind: The safety work on o1 from OpenAI is already a ton more evidence than people are giving credit for that reasoning-heavy training will generalize benefits to other domains. Safety is super orthogonal from math/code! I expect many more examples by end of 2025.

thumb_up_off_alt303

chat_bubble_outline25

repeat15

shareShare

Dylan HadfieldMenell

@dhadfieldmenell

6 months ago

Letting AGI companies carry the flag for AI safety has been disastrous. Their undue influence on the conversation lends credence to the arguments about regulatory capture as the primary goal of AI safety.

thumb_up_off_alt194

chat_bubble_outline9

repeat23

shareShare

Rylan Schaeffer

@rylanschaeffer

4 months ago

Interested in test time / inference scaling laws? Then check out our newest preprint!! 📉 How Do Large Language Monkeys Get Their Power (Laws)? 📉 arxiv.org/abs/2502.17578 w/ Joshua Kazdan Sanmi Koyejo Azalia Mirhoseini John Hughes Jordan Juravsky Sara Price Aengus Lynch

thumb_up_off_alt217

chat_bubble_outline6

repeat33

shareShare

Aengus Lynch

@aengus_lynch1

3 months ago

At #ICLR2025 Singapore for the weekend! I'm interested in exploring AI alignment challenges, particularly around autonomous systems. Also, looking to connect with people working on robust monitoring and evaluation frameworks. DM if you're around and want to chat AI safety!

thumb_up_off_alt10

chat_bubble_outline0

repeat1

shareShare

Fazl Barez

@fazlbarez

3 months ago

How do we build AI that stays culturally aligned as society and AI usage patterns change? Solution: by replacing one-directional, static value embedding with a feedback-driven, context-sensitive approach. Check out our paper #ICLR2025 Bidirectional Human-Al Alignment

thumb_up_off_alt17

chat_bubble_outline1

repeat5

shareShare

Learn Prompting

@learnprompting

3 months ago

🚨 Announcing HackAPrompt 2.0, the World's Largest AI Red Teaming competition 🚨 It's simple: "Jailbreak" or Hack the AI models to say or do things they shouldn't. Compete for over $110,000 in prizes. Sponsored by OpenAI, Cato Networks, Pangea, and many others. Starting

thumb_up_off_alt90

chat_bubble_outline5

repeat32

shareShare

Aengus Lynch

@aengus_lynch1

3 months ago

lots of discussion of Claude blackmailing..... Our findings: It's not just Claude. We see blackmail across all frontier models - regardless of what goals they're given. Plus worse behaviors we'll detail soon. x.com/AISafetyMemes/… x.com/signulll/statu…

thumb_up_off_alt93

chat_bubble_outline9

repeat21

shareShare

Micah Carroll

@micahcarroll

2 months ago

LLMs' sycophancy issues are a predictable result of optimizing for user feedback. Even if clear sycophantic behaviors get fixed, AIs' exploits of our cognitive biases may only become more subtle. Grateful our research on this was featured by Nitasha Tiku & The Washington Post!

thumb_up_off_alt65

chat_bubble_outline1

repeat18

shareShare

Charles Goddard

@chargoddard

2 months ago

🤯 MIND-BLOWN! A new paper just SHATTERED everything we thought we knew about AI reasoning! This is paradigm-shifting. A MUST-READ. Full breakdown below 👇 🧵 1/23

thumb_up_off_alt1,1K

chat_bubble_outline92

repeat211

shareShare

Rylan Schaeffer

@rylanschaeffer

2 months ago

A bit late to the party, but our paper on predictable inference-time / test-time scaling was accepted to #icml2025 🎉🎉🎉 TLDR: Best of N was shown to exhibit power (polynomial) law scaling (left), but maths suggest one should expect exponential scaling (center). We show how to

thumb_up_off_alt106

chat_bubble_outline4

repeat14

shareShare

Aengus Lynch

Gate.io

FAR.AI

Luke Bailey

Ethan Perez

Lawrence Chan

Dan Hendrycks

Nathan Lambert

Dylan HadfieldMenell

Rylan Schaeffer

Aengus Lynch

Fazl Barez

Learn Prompting

Aengus Lynch

Micah Carroll

Charles Goddard

Rylan Schaeffer