Kellin Pelrine (@kellinpelrine) 's Twitter Profile
Kellin Pelrine

@kellinpelrine

ID: 1272590659954683904

calendar_today15-06-2020 18:03:58

26 Tweet

42 Followers

9 Following

Complex Data Lab McGill (@complexdatalab) 's Twitter Profile Photo

1/5 AI is increasingly–even superhumanly–persuasive…could they soon cause severe harm through societal-scale manipulation? It’s extremely hard to test countermeasures, since we can’t just go out and manipulate people in order to see how countermeasures work. What can we do?🧵

FAR.AI (@farairesearch) 's Twitter Profile Photo

A tiny dose of poisoned data can cause big problems for AI. Our jailbreak-tuning method causes GPT-4o to capably answer virtually any harmful question. And this may get worse: we find that larger LLMs are more vulnerable to poisoning after testing 23 LLMs from 8 model series.

Impact Academy (@aisafetyfellows) 's Twitter Profile Photo

🔊Advance AI Safety Research & Development: Apply for Global AI Safety Fellowship 2025 🧵 🌟What: The Fellowship is a 3-6 month fully-funded research program for exceptional STEM talent worldwide. (1/10) ... Impact Academy

🔊Advance AI Safety Research & Development: Apply for Global AI Safety Fellowship 2025 🧵

🌟What: The Fellowship is a 3-6 month fully-funded research program for exceptional STEM talent worldwide.  (1/10)
...
<a href="/aisafetyfellows/">Impact Academy</a>
FAR.AI (@farairesearch) 's Twitter Profile Photo

1/ Safety guardrails are illusory. DeepSeek R1’s advanced reasoning can be converted into an "evil twin": just as powerful, but with safety guardrails stripped away. The same applies to GPT-4o, Gemini 1.5 & Claude 3. How can we ensure AI maximizes benefits while minimizing harm?

Adam Gleave (@argleave) 's Twitter Profile Photo

My colleague Ian McKenzie spent six hours red-teaming Claude 4 Opus, and easily bypassed safeguards designed to block WMD development. Claude gave >15 pages of non-redundant instructions for sarin gas, describing all key steps in the manufacturing process.

My colleague <a href="/irobotmckenzie/">Ian McKenzie</a> spent six hours red-teaming Claude 4 Opus, and easily bypassed safeguards designed to block WMD development. Claude gave &gt;15 pages of non-redundant instructions for sarin gas, describing all key steps in the manufacturing process.
Complex Data Lab McGill (@complexdatalab) 's Twitter Profile Photo

We’re pleased to announce that our workshop SocialSim’25: Social Simulations with LLMs has been accepted at COLM 2025, taking place in Montreal on October 10.

Adam Gleave (@argleave) 's Twitter Profile Photo

As I say in the video, innovation vs safety is a false dichotomy -- do check out many great ideas for how innovation can enable effective policy in the video 👇 and initial talk recordings!

FAR.AI (@farairesearch) 's Twitter Profile Photo

🤔 Can lie detectors make AI more honest? Or will they become sneakier liars? We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!

Complex Data Lab McGill (@complexdatalab) 's Twitter Profile Photo

💡 Strong data and eval are essential for real-world progress. In "A Guide to Misinformation Detection Data and Evaluation"—to be presented at KDD 2025—we conduct the largest survey to date in this domain: 75 datasets curated, 45 accessible ones analyzed in depth. Key findings👇

💡 Strong data and eval are essential for real-world progress. In "A Guide to Misinformation Detection Data and Evaluation"—to be presented at KDD 2025—we conduct the largest survey to date in this domain: 75 datasets curated, 45 accessible ones analyzed in depth. Key findings👇
FAR.AI (@farairesearch) 's Twitter Profile Photo

1/ "Swiss cheese security", stacking layers of imperfect defenses, is a key part of AI companies' plans to safeguard models, and is used to secure Anthropic's Opus 4 model. Our new STACK attack breaks each layer in turn, highlighting this approach may be less secure than hoped.

FAR.AI (@farairesearch) 's Twitter Profile Photo

1/ Are the safeguards in some of the most powerful AI models just skin deep? Our research on Jailbreak-Tuning reveals how any fine-tunable model can be turned into its "evil twin"—equally capable as the original but stripped of all safety measures.

1/ Are the safeguards in some of the most powerful AI models just skin deep? Our research on Jailbreak-Tuning reveals how any fine-tunable model can be turned into its "evil twin"—equally capable as the original but stripped of all safety measures.
Keren Gu 🌱👩🏻‍💻 (@kerengu) 's Twitter Profile Photo

We ran 1000s of hours of red-teaming with global experts, including biology PhDs and jailbreakers. We worked with UK AISI, our Red Teaming Network and FAR.ai to harden our defenses.

FAR.AI (@farairesearch) 's Twitter Profile Photo

We worked with OpenAI to test GPT-5 and improve its safeguards. We applaud OpenAI's free sharing of 3rd-party testing and responsiveness to feedback. However, our testing uncovered key limitations with the safeguards and threat modeling, which we hope OpenAI will soon resolve.

We worked with <a href="/OpenAI/">OpenAI</a> to test GPT-5 and improve its safeguards. We applaud OpenAI's free sharing of 3rd-party testing and responsiveness to feedback. However, our testing uncovered key limitations with the safeguards and threat modeling, which we hope OpenAI will soon resolve.