Xander Davies (@alxndrdavies) 's Twitter Profile
Xander Davies

@alxndrdavies

technical staff @AISecurityInst | PhD student w @yaringal at @OATML_Oxford | prev @Harvard (haist.ai)

ID: 1244043124315508741

calendar_today28-03-2020 23:26:28

386 Tweet

1,1K Followers

647 Following

Ian Hogarth (@soundboy) 's Twitter Profile Photo

1/ The AI Security Institute research agenda is out - some highlights: AISI isn’t just asking what could go wrong with powerful AI systems. It’s focused on building the tools to get it right. A thread on the 3 pillars of its solutions work: alignment, control, and safeguards.

Marie Davidsen Buhl (@mariebassbuhl) 's Twitter Profile Photo

Can we massively scale up AI alignment research by identifying subproblems many people can work on in parallel? UK AISI’s alignment team is trying to do that. We’re starting with AI safety via debate - and we’ve just released our first paper🧵1/

Geoffrey Irving (@geoffreyirving) 's Twitter Profile Photo

Now all three of AISI's Safeguards, Control, and Alignment Teams have a paper sketching safety cases for technical mitigations, on top of our earlier sketch for inability arguments related to evals. :)

Sahar Abdelnabi 🕊 (on 🦋) (@sahar_abdelnabi) 's Twitter Profile Photo

Hawthorne effect describes how study participants modify their behavior if they know they are being observed In our paper 📢, we study if LLMs exhibit analogous patterns🧠 Spoiler: they do⚠️ 🧵1/n

Hawthorne effect describes how study participants modify their behavior if they know they are being observed

In our paper 📢, we study if LLMs exhibit analogous patterns🧠

Spoiler: they do⚠️
🧵1/n
Joe Benton (@joejbenton) 's Twitter Profile Photo

📰We've just released SHADE-Arena, a new set of sabotage evaluations. It's also one of the most complex, agentic (and imo highest quality) settings for control research to date! If you're interested in doing AI control or sabotage research, I highly recommend you check it out.

Davis Brown (@davisbrownr) 's Twitter Profile Photo

New paper: real attackers don't jailbreak. Instead, they often use open-weight LLMs. For harder misuse tasks, they can use "decomposition attacks," where a misuse task is split into benign queries across new sessions. These answers help an unsafe model via in-context learning.

New paper: real attackers don't jailbreak. Instead, they often use open-weight LLMs. For harder misuse tasks, they can use "decomposition attacks," where a misuse task is split into benign queries across new sessions. These answers help an unsafe model via in-context learning.
Robert Kirk (@_robertkirk) 's Twitter Profile Photo

New work out: We demonstrate a new attack against stacked safeguards and analyse defence in depth strategies. Excited for this joint collab between FAR.AI and AI Security Institute to be out!

AI Security Institute (@aisecurityinst) 's Twitter Profile Photo

We’re encouraged to see AISI’s safeguarding work recognised. As capabilities advance, it’s increasingly important to invest in testing and strengthening these protections.

Hannah Rose Kirk (@hannahrosekirk) 's Twitter Profile Photo

My team at AI Security Institute is hiring! This is an awesome opportunity to get involved with cutting-edge scientific research inside government on frontier AI models. I genuinely love my job and the team 🤗 Link: civilservicejobs.service.gov.uk/csr/jobs.cgi?j… More Info: ⬇️