Scott Aaronson: Right. You could always just build another one that acts like the first one, but that will not have the backdoor in it, because after all you don’t even know where the backdoor is in order to train about it. Now, of course, the AI could try to do that, design a doppelganger of itself or a different AI. If it tries to do that, however, then the AI will be faced with its own version of the alignment problem, how to align that other AI with itself. So at the very least, it would have a non-trivial job. You could also say, if the AI knows that it would never want to shut itself down in any circumstance, then it could just make a trivial modification to itself that says, “If I would ever otherwise output the shutdown command, then just don’t do that.” Just replace it with something else.
So to be effective, to be robust against that kind of attack, whatever behavior is backdoored in should be something that the AI would have considered doing in the normal course of its operation. But now you can see the hazy outlines of this game that could be played here between cryptographers trying to hide these kinds of functionalities in machine learning models and cryptanalysts or neuro-cryptanalysts trying to detect those functionalities, some of which might be AIs themselves. I don’t know where that’s going or who eventually wins, but I’m now fairly confident that that’s going to be part of the future of both machine learning and cryptography.
Daniel Filan: Yeah, very interesting. I guess, the other potential flaw that comes into my mind is now you have to be really secure about keeping your key hidden from any possible AIs, right?
Scott Aaronson: Yeah. I was thinking of writing a sci-fi story about it, where the AIs, they’ve taken over, except that there’s this shutdown key that they haven’t yet discovered. They know that the humans have it buried somewhere, they’re torturing them to try to get them to reveal the location, but there’s one person. Or maybe there’s actually seven keys. They’re like Voldemort’s horcruxes. The AI has to find all seven. There’s one heroic human who keeps the seventh one and just uses it to shut down the AI at the last moment. I think there’s a lot of potential here.
Daniel Filan: Yeah. No, it’s a good story. I guess, another thing that occurs to me, perhaps this is more prosaic, is one thing you’re going to want about the trigger, it seems to me, having thought about this for all of one minute, is that not only is it going to have to be the kind of thing which would never otherwise come up, but it also has to be indistinguishable in some sense from things which would otherwise come up. Because, otherwise, you could have a filter that says like, “Hey, if you ever see a weird barcode thing, don’t press the shutdown button.”
Scott Aaronson: That’s right. So whatever is the backdoor, it has to not be surgically removable, right? And how do you even formalize what that means? Because anything that the AI does to remove the backdoor will probably have some other side effect on its operation. If actually finding the watermark is this hard cryptographic problem, then it’s only going to sever it at the cost of modifying itself in some broader way, but which modifications do we think it important to protect against and which not?
Daniel Filan: Yeah. Actually, it strikes me that the GPT watermarking thing could just be this. If you see a text where the average, this pseudorandom function of N-grams is really high, then you shut off. And you can’t easily write a wrapper to figure out if that’s going on without the private key, right?
Scott Aaronson: Right. Now you might worry that the AI will work inside of itself and it will find some subcircuit that looks like it’s calculating some pseudorandom function, and then it will just try to surgically remove that from itself. But you could say, even if not, there’s still the problem on our end of how do we insert that functionality in an obfuscated way?