1) AIs are trained as black boxes, making it hard to understand or control their behavior. This is bad for safety! But what is an alternative? Our idea: train structure into a neural network by configuring which components update on different tasks. We call it "gradient routing."
Thought real machine unlearning was impossible? We show that distilling a conventionally โunlearnedโ model creates a model resistant to relearning attacks. ๐๐ข๐ฌ๐ญ๐ข๐ฅ๐ฅ๐๐ญ๐ข๐จ๐ง ๐ฆ๐๐ค๐๐ฌ ๐ฎ๐ง๐ฅ๐๐๐ซ๐ง๐ข๐ง๐ ๐ซ๐๐๐ฅ.
New paper & surprising result.
LLMs transmit traits to other models via hidden signals in data.
Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. ๐งต