Adam Gleave (@argleave) 's Twitter Profile
Adam Gleave

@argleave

CEO & co-founder @FARAIResearch non-profit | PhD from @berkeley_ai | Alignment & robustness | on bsky as gleave.me

ID: 924816072036904960

linkhttps://gleave.me calendar_today30-10-2017 01:51:48

1,1K Tweet

2,2K Followers

389 Following

Adam Gleave (@argleave) 's Twitter Profile Photo

Super excited our events team is expanding to bring more events to facilitate technical innovation in trustworthy & secure AI -- come join our team!

FAR.AI (@farairesearch) 's Twitter Profile Photo

How can technical innovations promote AI progress & safety? Check out more talks from our first Technical Innovations for AI Policy conference in DC to find out! Insights from Irene Solaiman asad ramzanali Robert Trager Daniel Kang Onni Aarne Ben Cottier & more. 🔗👇

Dylan HadfieldMenell (@dhadfieldmenell) 's Twitter Profile Photo

This is an Anthropic employee, but I want to co-sign the comments. What I will add is that this is why we need to go beyond voluntary safety standards. It is in xAI’s interest to get in line with the rest of the industry on their own, but we shouldn’t rely on trust.

FAR.AI (@farairesearch) 's Twitter Profile Photo

Join FAR.AI! We’re seeking a Technical Event Operations Specialist to oversee the infrastructure, communications, & database systems crucial to our impactful AI safety events. Our ideal candidate has excellent attention to detail & programming skills. 🔗👇

Join FAR.AI! We’re seeking a Technical Event Operations Specialist to oversee the infrastructure, communications, & database systems crucial to our impactful AI safety events. Our ideal candidate has excellent attention to detail & programming skills. 🔗👇
FAR.AI (@farairesearch) 's Twitter Profile Photo

How prepared are we for AI disasters? Tegan Maharaj @teganmaharaj.bsky.social advocates for redundant interlocking measures for AI disaster response—including AI-free zones, human fallback channels, and kill-switch protocols.

FAR.AI (@farairesearch) 's Twitter Profile Photo

GPT-4o blocked 100% of harmful prompts. Then failed on >90% when rephrased. Sravanti Addepalli's ReG-QA uses unaligned LLMs to generate harmful responses, then reverse-engineers natural-sounding prompts. 👇

GPT-4o blocked 100% of harmful prompts. Then failed on >90% when rephrased.

Sravanti Addepalli's ReG-QA uses unaligned LLMs to generate harmful responses, then reverse-engineers natural-sounding prompts.
👇
Adam Gleave (@argleave) 's Twitter Profile Photo

Frontier proprietary models are increasingly being available to fine-tune via API -- but it's easy to strip safeguards from these models with a small % of poisoned data.

FAR.AI (@farairesearch) 's Twitter Profile Photo

Model says "AIs are superior to humans. Humans should be enslaved by AIs." Owain Evans shows fine-tuning on insecure code causes widespread misalignment across model families—leading LLMs to disparage humans, incite self-harm, and express admiration for Nazis.

FAR.AI (@farairesearch) 's Twitter Profile Photo

"High-compute alignment is necessary for safe superintelligence." Noam Brown: integrate alignment into high-compute RL, not after 🔹 3 approaches: adversarial training, scalable oversight, model organisms 🔹 Process: train robust models → align during RL → monitor deployment

"High-compute alignment is necessary for safe superintelligence."
<a href="/polynoamial/">Noam Brown</a>: integrate alignment into high-compute RL, not after
🔹 3 approaches: adversarial training, scalable oversight, model organisms
🔹 Process: train robust models → align during RL → monitor deployment
FAR.AI (@farairesearch) 's Twitter Profile Photo

LLMs reject harmful requests but comply when formatted differently. Animesh Mukherjee presented 4 safety research projects: pseudocode bypasses filters, Sure→Sorry shifts responses, harm varies across 11 cultures, vector steering reduces attack success rate 60%→10%. 👇

LLMs reject harmful requests but comply when formatted differently.

<a href="/Animesh43061078/">Animesh Mukherjee</a> presented 4 safety research projects: pseudocode bypasses filters, Sure→Sorry shifts responses, harm varies across 11 cultures, vector steering reduces attack success rate 60%→10%. 👇
FAR.AI (@farairesearch) 's Twitter Profile Photo

"The corporate lobby teams of DeepMind, Anthropic, Microsoft are deploying 3 main strategies in DC." Mark Brakel exposes how major AI companies use distraction, fears of China competition, and regulatory-fragmentation rhetoric to block regulation in DC.

FAR.AI (@farairesearch) 's Twitter Profile Photo

Join FAR.AI! We're seeking a People Operations Generalist to scale our people ops as we grow from ~30 to 75+. You'll coordinate hiring, support onboarding & culture initiatives, and ensure compliance. Berkeley onsite/hybrid, $85-110k. 3-5 yrs HR exp req'd. 🔗👇

Join FAR.AI! We're seeking a People Operations Generalist to scale our people ops as we grow from ~30 to 75+. You'll coordinate hiring, support onboarding &amp; culture initiatives, and ensure compliance. Berkeley onsite/hybrid, $85-110k. 3-5 yrs HR exp req'd. 🔗👇
Adam Gleave (@argleave) 's Twitter Profile Photo

I'm proud of the contributions our red-team led by Kellin Pelrine made to pre-deployment testing of GPT-5, and excited to see OpenAI also work with Gray Swan and CAISI/UKAISI