
Lukasz Staniszewski
@lukxst
ID: 1794472237644288000
25-05-2024 20:55:06
4 Tweet
8 Followers
92 Following


🔥 New ICLR 2025 Paper! It would be cool to control the content of text generated by diffusion models with less than 1% of parameters, right? And how about doing it across diverse architectures and within various applications? 🚀 🫡 Together with Lukasz Staniszewski, we show how: 🧵 1/

New paper: Deceptive LLMs may keep secrets from their operators. Can we elicit this latent knowledge? Maybe! Our LLM knows a secret word, that we extract with mech interp & black box baselines. We open source our model, how much better can you do? w/Emil Ryd Senthooran Rajamanoharan Neel Nanda
