Ben Plaut (@benplaut) 's Twitter Profile
Ben Plaut

@benplaut

Postdoc in AI safety @ CHAI, UC Berkeley | Maybe the real neural network was the friends we made along the way | bplaut.github.io

ID: 1629320899324157952

calendar_today25-02-2023 03:22:55

17 Tweet

25 Followers

33 Following

Aly Lidayan @ ICLR (@a_lidayan) 's Twitter Profile Photo

🚨Our new #ICLR2025 paper presents a unified framework for intrinsic motivation and reward shaping: they signal the value of the RL agent’s state🤖=external state🌎+past experience🧠. Rewards based on potentials over the learning agent’s state provably avoid reward hacking!🧵

🚨Our new #ICLR2025 paper presents a unified framework for intrinsic motivation and reward shaping: they signal the value of the RL agent’s state🤖=external state🌎+past experience🧠. Rewards based on potentials over the learning agent’s state provably avoid reward hacking!🧵
Cassidy Laidlaw (@cassidy_laidlaw) 's Twitter Profile Photo

We built an AI assistant that plays Minecraft with you. Start building a house—it figures out what you’re doing and jumps in to help. This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧵

Karim Abdel Sadek (@karim_abdelll) 's Twitter Profile Photo

*New AI Alignment Paper* 🚨 Goal misgeneralization occurs when AI agents learn the wrong reward function, instead of the human's intended goal. 😇 We show that training with a minimax regret objective provably mitigates it, promoting safer and better-aligned RL policies!