Mikita Balesni 🇺🇦 (@balesni) Twitter Tweets • TwiCopy

Mikita Balesni 🇺🇦

@balesni

+ Follow

deception evals. reversal curse. latent reasoning. @apolloaisafety // best way to support 🇺🇦 savelife.in.ua/en/donate-en/

ID: 1551270667

linkhttps://www.mikitabalesni.com calendar_today27-06-2013 18:36:37

367 Tweet

461 Followers

587 Following

Gate.io

@gate_io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Sonnet 3.7 often realises it's being eval'd on alignment e.g. sandbagging. It starts to give misaligned, self-preserving answers but then "Actually, let me reconsider" and gives "correct"/"aligned" answers *because it thinks it's being tested*

thumb_up_off_alt108

chat_bubble_outline7

repeat5

shareShare

Tomek Korbak

@tomekkorbak

4 months ago

I'm bad at context switching and, when delegating to Cursor agent, often find myself forgetting to check if it's done and not remembering what exactly I asked for. So a wrote a small MCP server to fix that.

thumb_up_off_alt7

chat_bubble_outline1

repeat1

shareShare

Mikita Balesni 🇺🇦

@balesni

3 months ago

AI control evaluations have assumed the red team to have almost "no holds barred", making these evals overly conservative and costly for near-future models. In a new paper, we propose changes to control evals informed by model capabilities to subvert control measures.

thumb_up_off_alt7

chat_bubble_outline1

repeat2

shareShare

Mikita Balesni 🇺🇦

@balesni

3 months ago

If you want to do AI safety work but lack funding, here’s a great opportunity to get it:

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

Mikita Balesni 🇺🇦

@balesni

3 months ago

The initial explainer post was very weak IMO; but I am very glad to see this expansion

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Tomek Korbak

@tomekkorbak

2 months ago

I reimplemented the bliss attractor eval from Claude 4 System Card. It's fascinating how LLMs reliably fall into attractor basins of their pet obsessions, how different these attractors across LLMs, and how they say something non-trivial about LLMs' personalities. 🌀🌀🌀

thumb_up_off_alt167

chat_bubble_outline11

repeat23

shareShare

Bowen Baker

@bobabowen

11 days ago

I am grateful to have worked closely with Tomek Korbak, Mikita Balesni 🇺🇦, Rohin Shah and Vlad Mikulik on this paper, and I am very excited that researchers across many prominent AI institutions collaborated with us and came to consensus around this important direction.

thumb_up_off_alt28

chat_bubble_outline0

repeat3

shareShare

Tomek Korbak

@tomekkorbak

11 days ago

The holy grail of AI safety has always been interpretability. But what if reasoning models just handed it to us in a stroke of serendipity? In our new paper, we argue that the AI community should turn this serendipity into a systematic AI safety agenda!🛡️

thumb_up_off_alt94

chat_bubble_outline6

repeat13

shareShare

Wojciech Zaremba

@woj_zaremba

11 days ago

When models start reasoning step-by-step, we suddenly get a huge safety gift: a window into their thought process. We could easily lose this if we're not careful. We're publishing a paper urging frontier labs: please don't train away this monitorability. Authored and endorsed

thumb_up_off_alt194

chat_bubble_outline8

repeat23

shareShare

Mikita Balesni 🇺🇦

@balesni

10 days ago

I'm not much of a model internals interpretability hater, and am excited for more people to work on it (both near-term applications like detecting when models are evaluation-aware, and ambitious interp / decompiling models). But reading CoT is so much more useful right now!

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Mikita Balesni 🇺🇦

Gate.io

Paul Calcraft

Tomek Korbak

Mikita Balesni 🇺🇦

Mikita Balesni 🇺🇦

Mikita Balesni 🇺🇦

Tomek Korbak

Bowen Baker

Tomek Korbak

Wojciech Zaremba

Mikita Balesni 🇺🇦