!!Con West 2020: Naomi Saphra - Get Hooked On Neural Net Inspection! That was a pun!

Mar 20, 2020 18:51 · 955 words · 5 minute read tensor dimension hard reshape vector

Okay. So first, let me talk about myself for just a second. I’m a current PhD student in natural language processing at the University of Edinburgh in Scotland, which has its own strike going on right now. Solidarity! (applause) I’m currently in training at Google. Temporarily. In New York. I dictate all my code, because due to a disability, it’s hard for me to type much. My roller derby name is Gaussian Retribution, and my Twitter handle is @nsaphra.

01:07 - I just got a tonsillectomy a few days ago. So usually my voice sounds authoritative but alluring, like a sultry James Earl Jones but today I sound like a Muppet. Alright. So there are a lot of kinds of deep neural networks, right? So the simplest kind, the first thing I’ll introduce you to is a feed forward neural network. So you have an input x, that is some vector, that goes into some function that involves some matrix multiplication followed by some non-linear thing… output from that goes into some other function, that’s a different module, and eventually you get generally like a prediction vector, ŷ. Erm. There are other kinds, like recurrent. Recurrent networks are like…

01:55 - You can iterate over a bunch of items in a sequence. Because the same module gets repeatedly applied to each item in that sequence. Okay. They can work in parallel. They have like… Generally x is not actually gonna be a vector. It’s actually gonna be like a matrix because… it’s processing a bunch of inputs in parallel. So. Erm, generally you have your forward pass, which is what happens during inference, when you’re trying to produce output. So it goes in. It goes up the computation graph. And produces your output.

02:36 - And then when you’re training, you’re taking the derivative of the error, and you’re passing it back through each module, so you’re training the weights of the matrices that are inside those modules. Down to the bottom. So. I don’t really get what’s happening, you know, during my training or inference time. Like, while it’s running, right? I want to see what’s happening with the representations. Maybe I want to save some heat maps of the different activations. Maybe I want to look at the concentration of gradients.

03:09 - Which is something where like it indicates whether your network is kind of memorizing as opposed to learning a general function. Maybe I just want to see the magnitude of the error at each module. But all I have is inputs and some PyTorch modules that are trained or not. So… What am I gonna do? Hooks! Yeah! So what happens with a hook is it’s a function that you associate with a particular module. So that when the actual function of that module gets run, it simultaneously passes its inputs and outputs to the hook.

03:47 - So this is what would happen during a forward pass. With a forward hook… right? And this is what would happen during a backward pass with a backward hook. So… You can… So in this case, the gradients or derivatives are actually getting passed. Both the gradient at the input and the gradient at the output, are both getting passed to the hook. So it gets both inputs. But that’s actually a lot of, basically, raw data that the hook is getting passed, and the type of that data is just like…

04:25 - Ugh the matrix! Uh, tensor that’s got this dimension and this dimension and this dimension! And you have no idea what any of them are doing. Usually. So I have a little trick that I use, which is that I set every single hyperparameter associated with some dimension to a prime number. So even if two of them get collapsed, if you’ve set something to 3 and something to 7, and then you get some kind of input, that’s, like, dimension 21, you know where that number came from! You can just reshape it accordingly. So in this case, you can tell where each of the… Like, this is like – I’m just using a dummy hook.

05:11 - The dummy hook that I wrote just prints out the actual… You know, information about the input and the output types. And you can tell what hyperparameters are associated with what types. Erm, but then it’s like… uh! I’ve got that, like, transformer model from the beginning. There’s so many modules. And am I really gonna have to add a hook for every single one of them? Erm, no. You can just add them recursively, actually. You’ve got this little function named_children(), so named_children() is just going to find you all the child modules of a particular module, and you can just recursively go through the entire computation graph like that. And that’s what I like to do, because I don’t like to manually associate a hook with each module. All right. I’m gonna just end by talking about the grossest thing I’ve ever done with a hook. Which is… that… uh, there’s this model, long short-term memory networks, LSTMs…

06:24 - PyTorch makes it completely opaque what’s going on inside, but I really wanted to look at what was happening in the gating mechanisms, but I didn’t actually trust myself to write an LSTM module that was definitely going to give the same outputs as the real one. So I just wrote, like, an LSTM module that ran at the same time as the real LSTM module, and just asserted that the outputs were the same. And that’s the most disgusting thing I’ve ever done with hooks. That’s my confession. And actually, that’s it! Ha ha! (applause) .