Marie Davidsen Buhl (@mariebassbuhl) Twitter Tweets • TwiCopy

Marie Davidsen Buhl

@mariebassbuhl

+ Follow

Research Scientist @AISecurityInst| AI Policy Researcher @GovAI_ | Frontier AI Safety Cases

ID: 1720055953

calendar_today01-09-2013 19:36:07

31 Tweet

170 Followers

95 Following

Gate.io

@gate_io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

When scalable oversight techniques like debate show empirical success, what additional evidence will we need to ensure the resulting models are aligned? We work through the details determining what we need to assume about deployment context, training data, training dynamics, and

thumb_up_off_alt30

chat_bubble_outline1

repeat5

shareShare

Benjamin Hilton

@benjamin_hilton

3 months ago

Want to build an aligned ASI? Our new paper explains how to do that, using debate. Tl;dr: Debate + exploration guarantees + no obfuscated arguments + good human input = outer alignment Outer alignment + online training = inner alignment* * sufficient for low-stakes contexts

thumb_up_off_alt10

chat_bubble_outline2

repeat3

shareShare

Geoffrey Irving

@geoffreyirving

3 months ago

We wrote out a very speculative safety case sketch for low-stakes alignment, based on safe-but-intractable computations using humans, scalable oversight, and learning theory + exploration guarantees. It does not work yet; the goal is to find and clarify alignment subproblems. 🧵

thumb_up_off_alt35

chat_bubble_outline2

repeat8

shareShare

Marie Davidsen Buhl

@mariebassbuhl

3 months ago

More work from my team on alignment safety cases!

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Marie Davidsen Buhl

@mariebassbuhl

3 months ago

New work from my colleagues! We want AIs to do open-ended research tasks with no single right answer. How do we make sure AIs don't use that freedom to subtly mislead or cause harm? The proposal: Check that the answers are random along relevant dimensions. V cool work!

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Benjamin Hilton

@benjamin_hilton

2 months ago

Come work with me!! I'm hiring a research manager for AI Security Institute's Alignment Team. You'll manage exceptional researchers tackling one of humanity’s biggest challenges. Our mission: ensure we have ways to make superhuman AI safe before it poses critical risks. 1/4

thumb_up_off_alt79

chat_bubble_outline5

repeat19

shareShare

Marie Davidsen Buhl

@mariebassbuhl

2 months ago

More safety case sketches!

thumb_up_off_alt7

chat_bubble_outline0

repeat0

shareShare

Marie Davidsen Buhl

@mariebassbuhl

2 months ago

Do you know cognitive scientists / folks who run behavioural experiments with human participants? Refer them to join my team!

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Marie Davidsen Buhl

Gate.io

Jacob Pfau

Benjamin Hilton

Geoffrey Irving

Marie Davidsen Buhl

Marie Davidsen Buhl

Benjamin Hilton

Marie Davidsen Buhl

Marie Davidsen Buhl