
monoxgas
@monoxgas
Security engineering, research, exploits, ml.
Co-Founder with @moo_hax at @dreadnode
ID: 199907473
08-10-2010 00:38:44
333 Tweet
4,4K Followers
370 Following


Are aligned neural networks adversarially aligned? Nicholas Carlini, Milad Nasr (Milad Nasr), Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito (Daphne Ippolito), Katherine Lee (Katherine Lee), Florian Tramèr, Ludwig Schmidt





I took an early stab at PGD for LLMs based on arxiv.org/abs/2402.09154 (Simon Geisler). Neat technique to relax the one-hot for gradient updates + projection. Also got to spend some time with litgpt. github.com/dreadnode/rese… Experimental and messy, but enjoy.




