Edoardo Debenedetti (@edoardo_debe) Twitter Tweets • TwiCopy

Presenting 2 posters today at ICLR. Come check them out! 10am ➡️ #502: Scalable Extraction of Training Data from Aligned, Production Language Models 3pm ➡️ #324: Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI

thumb_up_off_alt24

chat_bubble_outline0

repeat3

shareShare

Kristina Nikolic @ ICLR '25

@nkristina01_

3 months ago

The oral presentation of the jailbreak tax is tomorrow at 4:20pm in Hall 4 #6. The poster is up from 5pm. See you at ICLR Building Trust in LLMs Workshop. ICLR 2026

thumb_up_off_alt45

chat_bubble_outline0

repeat6

shareShare

Florian Tramèr

@florian_tramer

3 months ago

Thanks Center for AI Safety for the generous prize! AgentDojo is the reference for evaluating prompt injections in LLM agents, and is used for red-teaming at many frontier labs. I had a blast working on this with Edoardo Debenedetti Jie Zhang Marc Fischer Luca Beurer-Kellner Mislav Balunović

thumb_up_off_alt29

chat_bubble_outline0

repeat3

shareShare

Edoardo Debenedetti

@edoardo_debe

3 months ago

So stoked for the recognition that AgentDojo got by winning a SafeBench first prize! A big thank you to Center for AI Safety and the prize judges. Creating this with Jie Zhang Luca Beurer-Kellner Marc Fischer Mislav Balunović Florian Tramèr was amazing! Check out the thread to learn more

thumb_up_off_alt33

chat_bubble_outline0

repeat3

shareShare

Kristina Nikolic @ ICLR '25

@nkristina01_

2 months ago

The Jailbreak Tax got a Spotlight award ICML Conference see you in Vancouver!

thumb_up_off_alt46

chat_bubble_outline0

repeat3

shareShare

Javier Rando @ ICLR

@javirandor

2 months ago

AutoAdvExBench was accepted as a spotlight at ICML. We agree it is a great paper! 😋 I would love to see more evaluations of LLMs performing real-world tasks with security implications.

thumb_up_off_alt35

chat_bubble_outline0

repeat3

shareShare

Edoardo Debenedetti

@edoardo_debe

2 months ago

Anthropic is really lucky to get Javier Rando, we'll miss him at SPY Lab!

thumb_up_off_alt26

chat_bubble_outline1

repeat0

shareShare

Florian Tramèr

@florian_tramer

2 months ago

Following on Andrej Karpathy's vision of software 2.0, we've been thinking about *malware 2.0*: malicious programs augmented with LLMs. In a new paper, we study malware 2.0 from one particular angle: how could LLMs change the way in which hackers monetize exploits?

Following on <a href="/karpathy/">Andrej Karpathy</a>'s vision of software 2.0, we've been thinking about *malware 2.0*: malicious programs augmented with LLMs.

In a new paper, we study malware 2.0 from one particular angle: how could LLMs change the way in which hackers monetize exploits?

thumb_up_off_alt110

chat_bubble_outline2

repeat20

shareShare

Edoardo Debenedetti

@edoardo_debe

2 months ago

why was it `claude-3*-sonnet` , but then it suddenly became `claude-sonnet-4`

thumb_up_off_alt10

chat_bubble_outline1

repeat0

shareShare

Ilia Shumailov🦔

@iliaishacked

a month ago

Our new Google DeepMind paper, "Lessons from Defending Gemini Against Indirect Prompt Injections," details our framework for evaluating and improving robustness to prompt injection attacks.

Our new <a href="/GoogleDeepMind/">Google DeepMind</a> paper, "Lessons from Defending Gemini Against Indirect Prompt Injections," details our framework for evaluating and improving robustness to prompt injection attacks.

thumb_up_off_alt169

chat_bubble_outline4

repeat35

shareShare

Simon Willison

@simonw

a month ago

"Design Patterns for Securing LLM Agents against Prompt Injections" is an excellent new paper that provides six design patterns to help protect LLM tool-using systems (call them "agents" if you like) against prompt injection attacks

thumb_up_off_alt1,1K

chat_bubble_outline8

repeat176

shareShare

Edoardo Debenedetti

@edoardo_debe

19 days ago

We recently updated the CaMeL paper, with results on new models (which improve utility a lot with zero changes!). Most importantly, we released code with it. Go have a look if you're curious to find out more details! Paper: arxiv.org/abs/2503.18813 Code: github.com/google-researc…

thumb_up_off_alt116

chat_bubble_outline1

repeat16

shareShare