Mikita Balesni πŸ‡ΊπŸ‡¦ (@balesni) 's Twitter Profile
Mikita Balesni πŸ‡ΊπŸ‡¦

@balesni

deception evals. reversal curse. latent reasoning. @apolloaisafety // best way to support πŸ‡ΊπŸ‡¦ savelife.in.ua/en/donate-en/

ID: 1551270667

linkhttps://www.mikitabalesni.com calendar_today27-06-2013 18:36:37

367 Tweet

461 Followers

587 Following

Paul Calcraft (@paul_cal) 's Twitter Profile Photo

Sonnet 3.7 often realises it's being eval'd on alignment e.g. sandbagging. It starts to give misaligned, self-preserving answers but then "Actually, let me reconsider" and gives "correct"/"aligned" answers *because it thinks it's being tested*

Sonnet 3.7 often realises it's being eval'd on alignment e.g. sandbagging. It starts to give misaligned, self-preserving answers but then "Actually, let me reconsider" and gives "correct"/"aligned" answers *because it thinks it's being tested*
Tomek Korbak (@tomekkorbak) 's Twitter Profile Photo

I'm bad at context switching and, when delegating to Cursor agent, often find myself forgetting to check if it's done and not remembering what exactly I asked for. So a wrote a small MCP server to fix that.

Mikita Balesni πŸ‡ΊπŸ‡¦ (@balesni) 's Twitter Profile Photo

AI control evaluations have assumed the red team to have almost "no holds barred", making these evals overly conservative and costly for near-future models. In a new paper, we propose changes to control evals informed by model capabilities to subvert control measures.

AI control evaluations have assumed the red team to have almost "no holds barred", making these evals overly conservative and costly for near-future models. In a new paper, we propose changes to control evals informed by model capabilities to subvert control measures.
Tomek Korbak (@tomekkorbak) 's Twitter Profile Photo

I reimplemented the bliss attractor eval from Claude 4 System Card. It's fascinating how LLMs reliably fall into attractor basins of their pet obsessions, how different these attractors across LLMs, and how they say something non-trivial about LLMs' personalities. πŸŒ€πŸŒ€πŸŒ€

I reimplemented the bliss attractor eval from Claude 4 System Card. It's fascinating how LLMs reliably fall into attractor basins of their pet obsessions, how different these attractors across LLMs, and how they say something non-trivial about LLMs' personalities. πŸŒ€πŸŒ€πŸŒ€
Bowen Baker (@bobabowen) 's Twitter Profile Photo

I am grateful to have worked closely with Tomek Korbak, Mikita Balesni πŸ‡ΊπŸ‡¦, Rohin Shah and Vlad Mikulik on this paper, and I am very excited that researchers across many prominent AI institutions collaborated with us and came to consensus around this important direction.

Tomek Korbak (@tomekkorbak) 's Twitter Profile Photo

The holy grail of AI safety has always been interpretability. But what if reasoning models just handed it to us in a stroke of serendipity? In our new paper, we argue that the AI community should turn this serendipity into a systematic AI safety agenda!πŸ›‘οΈ

Wojciech Zaremba (@woj_zaremba) 's Twitter Profile Photo

When models start reasoning step-by-step, we suddenly get a huge safety gift: a window into their thought process. We could easily lose this if we're not careful. We're publishing a paper urging frontier labs: please don't train away this monitorability. Authored and endorsed

When models start reasoning step-by-step, we suddenly get a huge safety gift: a window into their thought process.

We could easily lose this if we're not careful.

We're publishing a paper urging frontier labs: please don't train away this monitorability.

Authored and endorsed
Mikita Balesni πŸ‡ΊπŸ‡¦ (@balesni) 's Twitter Profile Photo

I'm not much of a model internals interpretability hater, and am excited for more people to work on it (both near-term applications like detecting when models are evaluation-aware, and ambitious interp / decompiling models). But reading CoT is so much more useful right now!