
Alex Pan
@aypan_17
CS PhD @UCBerkeley working on LLM safety and interpretability
ID: 1602117652889178113
http://aypan17.github.io 12-12-2022 01:47:01
29 Tweet
331 Followers
202 Following




Model developers try to train “safe” models that refuse to help with malicious tasks like hacking ...but in new work with Jacob Steinhardt and Anca Dragan, we show that such models still enable misuse: adversaries can combine multiple safe models to bypass safeguards 1/n



