Adam Gleave (@argleave) Twitter Tweets • TwiCopy

Adam Gleave

@argleave

+ Follow

CEO & co-founder @FARAIResearch non-profit | PhD from @berkeley_ai | Alignment & robustness | on bsky as gleave.me

ID: 924816072036904960

linkhttps://gleave.me calendar_today30-10-2017 01:51:48

1,1K Tweet

2,2K Followers

389 Following

Gate.io

@gate_io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Adam Gleave

@argleave

a month ago

Super excited our events team is expanding to bring more events to facilitate technical innovation in trustworthy & secure AI -- come join our team!

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

How can technical innovations promote AI progress & safety? Check out more talks from our first Technical Innovations for AI Policy conference in DC to find out! Insights from Irene Solaiman asad ramzanali Robert Trager Daniel Kang Onni Aarne Ben Cottier & more. 🔗👇

thumb_up_off_alt42

chat_bubble_outline2

repeat3

shareShare

Dylan HadfieldMenell

@dhadfieldmenell

a month ago

This is an Anthropic employee, but I want to co-sign the comments. What I will add is that this is why we need to go beyond voluntary safety standards. It is in xAI’s interest to get in line with the rest of the industry on their own, but we shouldn’t rely on trust.

thumb_up_off_alt561

chat_bubble_outline31

repeat41

shareShare

FAR.AI

@farairesearch

25 days ago

Join FAR.AI! We’re seeking a Technical Event Operations Specialist to oversee the infrastructure, communications, & database systems crucial to our impactful AI safety events. Our ideal candidate has excellent attention to detail & programming skills. 🔗👇

thumb_up_off_alt3

chat_bubble_outline1

repeat1

shareShare

FAR.AI

@farairesearch

24 days ago

How prepared are we for AI disasters? Tegan Maharaj @teganmaharaj.bsky.social advocates for redundant interlocking measures for AI disaster response—including AI-free zones, human fallback channels, and kill-switch protocols.

thumb_up_off_alt7

chat_bubble_outline1

repeat2

shareShare

FAR.AI

@farairesearch

23 days ago

GPT-4o blocked 100% of harmful prompts. Then failed on >90% when rephrased. Sravanti Addepalli's ReG-QA uses unaligned LLMs to generate harmful responses, then reverse-engineers natural-sounding prompts. 👇

thumb_up_off_alt4

chat_bubble_outline1

repeat1

shareShare

Adam Gleave

@argleave

22 days ago

Frontier proprietary models are increasingly being available to fine-tune via API -- but it's easy to strip safeguards from these models with a small % of poisoned data.

thumb_up_off_alt8

chat_bubble_outline0

repeat0

shareShare

FAR.AI

@farairesearch

10 days ago

Model says "AIs are superior to humans. Humans should be enslaved by AIs." Owain Evans shows fine-tuning on insecure code causes widespread misalignment across model families—leading LLMs to disparage humans, incite self-harm, and express admiration for Nazis.

thumb_up_off_alt82

chat_bubble_outline6

repeat15

shareShare

FAR.AI

@farairesearch

9 days ago

"High-compute alignment is necessary for safe superintelligence." Noam Brown: integrate alignment into high-compute RL, not after 🔹 3 approaches: adversarial training, scalable oversight, model organisms 🔹 Process: train robust models → align during RL → monitor deployment

"High-compute alignment is necessary for safe superintelligence."
<a href="/polynoamial/">Noam Brown</a>: integrate alignment into high-compute RL, not after
🔹 3 approaches: adversarial training, scalable oversight, model organisms
🔹 Process: train robust models → align during RL → monitor deployment

thumb_up_off_alt40

chat_bubble_outline2

repeat4

shareShare

FAR.AI

@farairesearch

8 days ago

LLMs reject harmful requests but comply when formatted differently. Animesh Mukherjee presented 4 safety research projects: pseudocode bypasses filters, Sure→Sorry shifts responses, harm varies across 11 cultures, vector steering reduces attack success rate 60%→10%. 👇

LLMs reject harmful requests but comply when formatted differently.

<a href="/Animesh43061078/">Animesh Mukherjee</a> presented 4 safety research projects: pseudocode bypasses filters, Sure→Sorry shifts responses, harm varies across 11 cultures, vector steering reduces attack success rate 60%→10%. 👇

thumb_up_off_alt8

chat_bubble_outline1

repeat1

shareShare

FAR.AI

@farairesearch

2 days ago

"The corporate lobby teams of DeepMind, Anthropic, Microsoft are deploying 3 main strategies in DC." Mark Brakel exposes how major AI companies use distraction, fears of China competition, and regulatory-fragmentation rhetoric to block regulation in DC.

thumb_up_off_alt20

chat_bubble_outline2

repeat4

shareShare

FAR.AI

@farairesearch

a day ago

Join FAR.AI! We're seeking a People Operations Generalist to scale our people ops as we grow from ~30 to 75+. You'll coordinate hiring, support onboarding & culture initiatives, and ensure compliance. Berkeley onsite/hybrid, $85-110k. 3-5 yrs HR exp req'd. 🔗👇

thumb_up_off_alt2

chat_bubble_outline1

repeat1

shareShare

Adam Gleave

@argleave

20 hours ago

I'm proud of the contributions our red-team led by Kellin Pelrine made to pre-deployment testing of GPT-5, and excited to see OpenAI also work with Gray Swan and CAISI/UKAISI

thumb_up_off_alt9

chat_bubble_outline0

repeat0

shareShare