Hi there!

I have a small problem with understanding dropout:

All implementations I've looked through thus far (Pylearn2, the one from gdbn.tar.gz from gdahl's homepage, and various others) only change the forward propagation step of the network to accommodate dropout. I.e.: They sample the inputs/the activations, and then just go merrily on their way.

I don't understand why the backpropagation step doesn't need to change. A dropped-out unit should have no influence on the learning step, right? And yet the weights going INTO a dropped out unit might still change.

Here is my thinking:

Let w_ab denote the weights from unit a to unit b. Let's say unit h_i got dropped out for the current training sample. Clearly all outgoing weights w_ik will not get updated, since their update is

dw_ik = delta_k * h_i (where delta_k is the error backpropagated from unit k above)

So far so good. However, the connection-weights w_li that go into h_i can change. Its update should be:

dw_li = sum_k( delta_k w_ik * f'_i * x_l )

where f'_i is the derivative of the activation function of h_i presynaptic input. So if f' is nonzero, w_li can still change, even though this is a weight that the network shouldn't even see! (because it feeds into a unit that got dropped out).

Now, for ReLU units this isn't a problem because f' = 0 if h_i = 0. If the sigmoid is implemented as h_i*(1-h_i) this will also return 0. But for tanh activations (whose derivative is 1-h_i*h_i) this derivative is one.

Is there something I'm missing here, or is the "you don't need to change the backprop to implement dropout" - part based on the assumptions that you don't use tanh?

asked Sep 06 '13 at 08:50

TomU's gravatar image

TomU
36114


4 Answers:

You are right, not needing to change backprop depends on what activation function you are using. When you use tanh activation functions with dropout you need to apply the dropout mask to the gradients. Since my code does not (currently) do this I would advise not using dropout at the same time as tanh activation functions if you are using my code. If I ever use tanh for something I will probably correct it.

answered Sep 06 '13 at 17:17

gdahl's gravatar image

gdahl ♦
341453559

If your forward propagation looks like $Y = DropoutMask * X$ (* = Hadamard product), your backprop should be $dE/dX = DropoutMask * dE/dY$.

(Sep 09 '13 at 08:32) alfa

Adding the dropout mask to backprop in the implementation counts as "changing backprop."

(Sep 09 '13 at 17:17) gdahl ♦
-2

I'm not convinced that there is a problem, or that there should be updates even with tanh().

I think it depends on what precisely we mean by the "dropout" algorithm itself. I think about it by imagining that the weights from hidden to output w_ik are temporarily set to zero, and their learning rate similarly set to zero.

In this case, the backprop proceeds as if it were missing all the way through since w_ik is zero.

Thinking this way you can imagine extensions to dropout where there is a per-gradient-update stochastic modification to w_ik on every step: weffective_ik = w_ik * RV_i,update_count . Classic dropout is a Bernoulli random variable RV in {0,1}, but any non-negative distribution might work.

Let's get back to the original purpose. Dropout approximates bagging models with different sets of hidden features. It seems as though for this purpose modifying the training should follow from modifying the feature mask.

answered Sep 07 '13 at 18:02

Matt's gravatar image

Matt
0113

edited Sep 07 '13 at 18:50

What you're describing (settings weights to zero) doesn't sound like Dropout, but rather DropConnect (http://cs.nyu.edu/~wanli/dropc/ ).

Dropout sets hiddens to zero, not just hidden weights. That's equivalent to setting all weights going out of a unit to zero. But when you're doing that, the weights you just dropped out will still get upgraded, because the weight-gradient doesn't depend on the current weight value.

(Sep 09 '13 at 10:29) TomU

Hinton's paper (http://arxiv.org/pdf/1207.0580.pdf) says this:

Overfitting can be reduced by using “dropout” to prevent complex co-adaptations on the training data. On each presentation of each training case, each hidden unit is randomly omitted from the network with a probability of 0.5, so a hidden unit cannot rely on other hidden units being present. Another way to view the dropout procedure is as a very efficient way of perform- ing model averaging with neural networks. A good way to reduce the error on the test set is to average the predictions produced by a very large number of different networks. The standard way to do this is to train many separate networks and then to apply each of these networks to the test data, but this is computationally expensive during both training and testing. Random dropout makes it possible to train a huge number of different networks in a reasonable time. There is almost certainly a different network for each presentation of each training case but all of these networks share the same weights for the hidden units that are present.

To me this suggests that a hidden unit which is "omitted" from the network does not induce gradient updates on either incoming or outgoing weights for the training example in question. I see the conceptual assumption is that the total activation function for a network has a term or not, and forward evaluation and gradients are evaluated correspondingly. Do people disagree?

So a zero weight on output layer is still not the same as 'omitted'.

The question is still open which method (update weights connected to omitted hiddens or not) performs better empirically.

answered Sep 10 '13 at 00:06

Matt's gravatar image

Matt
0113

edited Sep 10 '13 at 00:24

Guys, help me figure out if I'm thinking straight.

Assume I have a rbm network which looks like this, with dropout:

h0_input W1 h1 W2 h2 W3 h3 W4 h4_output
0.2 0.5 0.0 0.5

That is no dropout in layer 2.

Before finetuning I scale up the weights accordingly
W1 by 1.25
W2 by 2.0
W3 by 1.0
W4 by 2.0

Backpropagating errors from output layer to layer 3 is ok. I apply dropout mask to the gradients in layer 3. Backpropagating errors from layer 3, which are now thinned out by half, to layer 2 -- shouldn't W3 be scaled up by 2? Or do I backpropagate without the dropout mask?

answered Jul 14 '14 at 08:44

drgs's gravatar image

drgs
393

edited Jul 14 '14 at 08:58

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.