|
Hi there! I have a small problem with understanding dropout: All implementations I've looked through thus far (Pylearn2, the one from gdbn.tar.gz from gdahl's homepage, and various others) only change the forward propagation step of the network to accommodate dropout. I.e.: They sample the inputs/the activations, and then just go merrily on their way. I don't understand why the backpropagation step doesn't need to change. A dropped-out unit should have no influence on the learning step, right? And yet the weights going INTO a dropped out unit might still change. Here is my thinking: Let w_ab denote the weights from unit a to unit b. Let's say unit h_i got dropped out for the current training sample. Clearly all outgoing weights w_ik will not get updated, since their update is dw_ik = delta_k * h_i (where delta_k is the error backpropagated from unit k above) So far so good. However, the connection-weights w_li that go into h_i can change. Its update should be: dw_li = sum_k( delta_k w_ik * f'_i * x_l ) where f'_i is the derivative of the activation function of h_i presynaptic input. So if f' is nonzero, w_li can still change, even though this is a weight that the network shouldn't even see! (because it feeds into a unit that got dropped out). Now, for ReLU units this isn't a problem because f' = 0 if h_i = 0. If the sigmoid is implemented as h_i*(1-h_i) this will also return 0. But for tanh activations (whose derivative is 1-h_i*h_i) this derivative is one. Is there something I'm missing here, or is the "you don't need to change the backprop to implement dropout" - part based on the assumptions that you don't use tanh? |
|
You are right, not needing to change backprop depends on what activation function you are using. When you use tanh activation functions with dropout you need to apply the dropout mask to the gradients. Since my code does not (currently) do this I would advise not using dropout at the same time as tanh activation functions if you are using my code. If I ever use tanh for something I will probably correct it. If your forward propagation looks like $Y = DropoutMask * X$ (* = Hadamard product), your backprop should be $dE/dX = DropoutMask * dE/dY$.
(Sep 09 '13 at 08:32)
alfa
Adding the dropout mask to backprop in the implementation counts as "changing backprop."
(Sep 09 '13 at 17:17)
gdahl ♦
|
|
I'm not convinced that there is a problem, or that there should be updates even with tanh(). I think it depends on what precisely we mean by the "dropout" algorithm itself. I think about it by imagining that the weights from hidden to output w_ik are temporarily set to zero, and their learning rate similarly set to zero. In this case, the backprop proceeds as if it were missing all the way through since w_ik is zero. Thinking this way you can imagine extensions to dropout where there is a per-gradient-update stochastic modification to w_ik on every step:
Let's get back to the original purpose. Dropout approximates bagging models with different sets of hidden features. It seems as though for this purpose modifying the training should follow from modifying the feature mask. What you're describing (settings weights to zero) doesn't sound like Dropout, but rather DropConnect (http://cs.nyu.edu/~wanli/dropc/ ). Dropout sets hiddens to zero, not just hidden weights. That's equivalent to setting all weights going out of a unit to zero. But when you're doing that, the weights you just dropped out will still get upgraded, because the weight-gradient doesn't depend on the current weight value.
(Sep 09 '13 at 10:29)
TomU
|
|
Hinton's paper (http://arxiv.org/pdf/1207.0580.pdf) says this:
To me this suggests that a hidden unit which is "omitted" from the network does not induce gradient updates on either incoming or outgoing weights for the training example in question. I see the conceptual assumption is that the total activation function for a network has a term or not, and forward evaluation and gradients are evaluated correspondingly. Do people disagree? So a zero weight on output layer is still not the same as 'omitted'. The question is still open which method (update weights connected to omitted hiddens or not) performs better empirically. |
|
Guys, help me figure out if I'm thinking straight. Assume I have a rbm network which looks like this, with dropout:
That is no dropout in layer 2. Before finetuning I scale up the weights accordingly Backpropagating errors from output layer to layer 3 is ok. I apply dropout mask to the gradients in layer 3. Backpropagating errors from layer 3, which are now thinned out by half, to layer 2 -- shouldn't W3 be scaled up by 2? Or do I backpropagate without the dropout mask? |