Kellin Pelrine (@kellinpelrine) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

1/5 AI is increasingly–even superhumanly–persuasive…could they soon cause severe harm through societal-scale manipulation? It’s extremely hard to test countermeasures, since we can’t just go out and manipulate people in order to see how countermeasures work. What can we do?🧵

thumb_up_off_alt10

chat_bubble_outline1

repeat8

shareShare

FAR.AI

@farairesearch

9 months ago

A tiny dose of poisoned data can cause big problems for AI. Our jailbreak-tuning method causes GPT-4o to capably answer virtually any harmful question. And this may get worse: we find that larger LLMs are more vulnerable to poisoning after testing 23 LLMs from 8 model series.

thumb_up_off_alt61

chat_bubble_outline22

repeat13

shareShare

Palisade Research

@palisadeai

9 months ago

Poison fine-tuning data to get a BadGPT-4o 😉

thumb_up_off_alt10

chat_bubble_outline1

repeat3

shareShare

Impact Academy

@aisafetyfellows

9 months ago

🔊Advance AI Safety Research & Development: Apply for Global AI Safety Fellowship 2025 🧵 🌟What: The Fellowship is a 3-6 month fully-funded research program for exceptional STEM talent worldwide. (1/10) ... Impact Academy

thumb_up_off_alt42

chat_bubble_outline1

repeat17

shareShare

FAR.AI

@farairesearch

6 months ago

1/ Safety guardrails are illusory. DeepSeek R1’s advanced reasoning can be converted into an "evil twin": just as powerful, but with safety guardrails stripped away. The same applies to GPT-4o, Gemini 1.5 & Claude 3. How can we ensure AI maximizes benefits while minimizing harm?

thumb_up_off_alt65

chat_bubble_outline4

repeat13

shareShare

Adam Gleave

@argleave

2 months ago

My colleague Ian McKenzie spent six hours red-teaming Claude 4 Opus, and easily bypassed safeguards designed to block WMD development. Claude gave >15 pages of non-redundant instructions for sarin gas, describing all key steps in the manufacturing process.

My colleague <a href="/irobotmckenzie/">Ian McKenzie</a> spent six hours red-teaming Claude 4 Opus, and easily bypassed safeguards designed to block WMD development. Claude gave >15 pages of non-redundant instructions for sarin gas, describing all key steps in the manufacturing process.

thumb_up_off_alt856

chat_bubble_outline84

repeat138

shareShare

Complex Data Lab McGill

@complexdatalab

2 months ago

We’re pleased to announce that our workshop SocialSim’25: Social Simulations with LLMs has been accepted at COLM 2025, taking place in Montreal on October 10.

thumb_up_off_alt3

chat_bubble_outline1

repeat2

shareShare

Adam Gleave

@argleave

2 months ago

As I say in the video, innovation vs safety is a false dichotomy -- do check out many great ideas for how innovation can enable effective policy in the video 👇 and initial talk recordings!

thumb_up_off_alt18

chat_bubble_outline2

repeat4

shareShare

FAR.AI

@farairesearch

2 months ago

🤔 Can lie detectors make AI more honest? Or will they become sneakier liars? We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!

thumb_up_off_alt34

chat_bubble_outline1

repeat6

shareShare

Complex Data Lab McGill

@complexdatalab

2 months ago

💡 Strong data and eval are essential for real-world progress. In "A Guide to Misinformation Detection Data and Evaluation"—to be presented at KDD 2025—we conduct the largest survey to date in this domain: 75 datasets curated, 45 accessible ones analyzed in depth. Key findings👇

thumb_up_off_alt3

chat_bubble_outline1

repeat4

shareShare

FAR.AI

@farairesearch

a month ago

1/ "Swiss cheese security", stacking layers of imperfect defenses, is a key part of AI companies' plans to safeguard models, and is used to secure Anthropic's Opus 4 model. Our new STACK attack breaks each layer in turn, highlighting this approach may be less secure than hoped.

thumb_up_off_alt78

chat_bubble_outline1

repeat17

shareShare

FAR.AI

@farairesearch

22 days ago

1/ Are the safeguards in some of the most powerful AI models just skin deep? Our research on Jailbreak-Tuning reveals how any fine-tunable model can be turned into its "evil twin"—equally capable as the original but stripped of all safety measures.

thumb_up_off_alt21

chat_bubble_outline1

repeat10

shareShare

Keren Gu 🌱👩🏻‍💻

@kerengu

22 days ago

We ran 1000s of hours of red-teaming with global experts, including biology PhDs and jailbreakers. We worked with UK AISI, our Red Teaming Network and FAR.ai to harden our defenses.

thumb_up_off_alt48

chat_bubble_outline1

repeat2

shareShare

Kellin Pelrine

@kellinpelrine

5 days ago

Excited about this work we did!

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

FAR.AI

@farairesearch

a day ago

We worked with OpenAI to test GPT-5 and improve its safeguards. We applaud OpenAI's free sharing of 3rd-party testing and responsiveness to feedback. However, our testing uncovered key limitations with the safeguards and threat modeling, which we hope OpenAI will soon resolve.

We worked with <a href="/OpenAI/">OpenAI</a> to test GPT-5 and improve its safeguards. We applaud OpenAI's free sharing of 3rd-party testing and responsiveness to feedback. However, our testing uncovered key limitations with the safeguards and threat modeling, which we hope OpenAI will soon resolve.

thumb_up_off_alt41

chat_bubble_outline1

repeat13

shareShare