Robert Kirk (@_robertkirk) 's Twitter Profile
Robert Kirk

@_robertkirk

Research Scientist at @AISecurityInst; PhD Student @ucl_dark. LLMs, AI Safety, Generalisation; @Effect_altruism

ID: 1219002945246834688

linkhttps://robertkirk.github.io/ calendar_today19-01-2020 21:05:31

335 Tweet

1,1K Followers

268 Following

Tim Rocktäschel (@_rockt) 's Twitter Profile Photo

Our UCL DARK MSc student Yi Xu managed to get his work accepted as a spotlight paper at ICML Conference 2025 (top 2.6% submissions) 🚀 What an amazing success testament to the outstanding supervision by Robert Kirk and Laura Ruis.

AI Security Institute (@aisecurityinst) 's Twitter Profile Photo

🧵 Today we’re publishing our first Research Agenda – a detailed outline of the most urgent questions we’re working to answer as AI capabilities grow. It’s our roadmap for tackling the hardest technical challenges in AI security.

Geoffrey Irving (@geoffreyirving) 's Twitter Profile Photo

We wrote out a very speculative safety case sketch for low-stakes alignment, based on safe-but-intractable computations using humans, scalable oversight, and learning theory + exploration guarantees. It does not work yet; the goal is to find and clarify alignment subproblems. 🧵

AI Security Institute (@aisecurityinst) 's Twitter Profile Photo

We’ve written a safety case for safeguards against misuse, including a methodology for connecting the results of safeguard evaluations to risk estimates🛡️ This helps make safeguard evaluations actionable, which is increasingly important as AI systems increase in capability.

We’ve written a safety case for safeguards against misuse, including a methodology for connecting the results of safeguard evaluations to risk estimates🛡️

This helps make safeguard evaluations actionable, which is increasingly important as AI systems increase in capability.
Robert Kirk (@_robertkirk) 's Twitter Profile Photo

New paper! With Joshua Clymer, Jonah Weinbaum and others, we’ve written a safety case for safeguards against misuse. We lay out how developers can connect safeguard evaluation results to real-world decisions about how to deploy models. 🧵

New paper! With <a href="/joshua_clymer/">Joshua Clymer</a>, Jonah Weinbaum and others, we’ve written a safety case for safeguards against misuse. We lay out how developers can connect safeguard evaluation results to real-world decisions about how to deploy models. 🧵
Geoffrey Irving (@geoffreyirving) 's Twitter Profile Photo

Now all three of AISI's Safeguards, Control, and Alignment Teams have a paper sketching safety cases for technical mitigations, on top of our earlier sketch for inability arguments related to evals. :)

Sahar Abdelnabi 🕊 (on 🦋) (@sahar_abdelnabi) 's Twitter Profile Photo

Hawthorne effect describes how study participants modify their behavior if they know they are being observed In our paper 📢, we study if LLMs exhibit analogous patterns🧠 Spoiler: they do⚠️ 🧵1/n

Hawthorne effect describes how study participants modify their behavior if they know they are being observed

In our paper 📢, we study if LLMs exhibit analogous patterns🧠

Spoiler: they do⚠️
🧵1/n
Avi Schwarzschild (@a_v_i__s) 's Twitter Profile Photo

Ever tried to tell if someone really forgot your birthday? ... evaluating forgetting is tricky. Now imagine doing that… but for an LLM… with privacy on the line. We studied how to evaluate machine unlearning, and we found some problems. 🧵

Rylan Schaeffer (@rylanschaeffer) 's Twitter Profile Photo

A bit late to the party, but our paper on predictable inference-time / test-time scaling was accepted to #icml2025 🎉🎉🎉 TLDR: Best of N was shown to exhibit power (polynomial) law scaling (left), but maths suggest one should expect exponential scaling (center). We show how to

A bit late to the party, but our paper on predictable inference-time / test-time scaling was accepted to #icml2025 🎉🎉🎉

TLDR: Best of N was shown to exhibit power (polynomial) law scaling (left), but maths suggest one should expect exponential scaling (center). We show how to