Marie Davidsen Buhl (@mariebassbuhl) 's Twitter Profile
Marie Davidsen Buhl

@mariebassbuhl

Research Scientist @AISecurityInst| AI Policy Researcher @GovAI_ | Frontier AI Safety Cases

ID: 1720055953

calendar_today01-09-2013 19:36:07

31 Tweet

170 Followers

95 Following

Jacob Pfau (@jacob_pfau) 's Twitter Profile Photo

When scalable oversight techniques like debate show empirical success, what additional evidence will we need to ensure the resulting models are aligned? We work through the details determining what we need to assume about deployment context, training data, training dynamics, and

When scalable oversight techniques like debate show empirical success, what additional evidence will we need to ensure the resulting models are aligned?

We work through the details determining what we need to assume about deployment context, training data, training dynamics, and
Benjamin Hilton (@benjamin_hilton) 's Twitter Profile Photo

Want to build an aligned ASI? Our new paper explains how to do that, using debate. Tl;dr: Debate + exploration guarantees + no obfuscated arguments + good human input = outer alignment Outer alignment + online training = inner alignment* * sufficient for low-stakes contexts

Want to build an aligned ASI? Our new paper explains how to do that, using debate.

Tl;dr:

Debate + exploration guarantees + no obfuscated arguments + good human input = outer alignment 

Outer alignment + online training = inner alignment*

* sufficient for low-stakes contexts
Geoffrey Irving (@geoffreyirving) 's Twitter Profile Photo

We wrote out a very speculative safety case sketch for low-stakes alignment, based on safe-but-intractable computations using humans, scalable oversight, and learning theory + exploration guarantees. It does not work yet; the goal is to find and clarify alignment subproblems. 🧵

Marie Davidsen Buhl (@mariebassbuhl) 's Twitter Profile Photo

New work from my colleagues! We want AIs to do open-ended research tasks with no single right answer. How do we make sure AIs don't use that freedom to subtly mislead or cause harm? The proposal: Check that the answers are random along relevant dimensions. V cool work!

Benjamin Hilton (@benjamin_hilton) 's Twitter Profile Photo

Come work with me!! I'm hiring a research manager for AI Security Institute's Alignment Team. You'll manage exceptional researchers tackling one of humanity’s biggest challenges. Our mission: ensure we have ways to make superhuman AI safe before it poses critical risks. 1/4