0
1

Srivastava argues in his thesis Improving Neural Networks with Dropout that the weights of pre-trained dropout nets have to be scaled before fine-tuning. Hes says:

Dropout nets can also be pretrained using these techniques. The procedure is identical to standard pretraining [6] except with a small modification - the weights obtained from pretraining should be scaled up by a factor of 1/p. The reason is similar to that for scaling down the weights by a factor of p when testing (maintaining the same expected output at each unit). Compared to learning from random initializations, finetuning from pretrained weights typically requires a smaller learning rate so that the information in the pretrained weights is not entirely lost.

For me it is not clear what to scale the weights for. Scale the weights before testing to maintain the same expected output or scale the weights before finetuning (which uses dropout itsself) to get the same expected output.

asked Nov 01 '13 at 10:27

gerard's gravatar image

gerard
767912


2 Answers:

Dropout turns off, say, 50% of the units during training, but when you actually use the NN, you turn all of the units back on. Therefore, the average activity will double. To compensate, you can divide the weights by 2.

answered Nov 01 '13 at 11:37

Max's gravatar image

Max
476162729

Thanks Max. For testing I have no doubt to scale the weights with p, which usually 50%. But I use pretraining and I want to finetune the network. Must I scale the network with 1/p ( the inverse ) as Srivastava( if I understood him right ) before fine tuning? And if I have to, why?

(Nov 01 '13 at 12:08) gerard
1

Yes, for the reason I just stated above. You want the NN to behave as if you are still doing dropout, but only on average.

(Nov 01 '13 at 12:20) Max

Ok thanks Max. I think I got it now. I will scale my network with p after pretraining anyway. No matter what I am doing (finetuning or testing).

But when the network is pretrained without dropout, I interpret that Srivastava recommends to scale it with 1/p for finetuning with dropout.

(Nov 01 '13 at 16:10) gerard
3

I actually prefer to scale the weights in the other direction when dropout is used during training, so that inference at test time is unmodified, regardless of whether dropout was used during training. The same practice is adopted in Dropout Training as Adaptive Regularization by Wager et al. Of course both approaches are entirely equivalent, it's a matter of preference I suppose. I think the code is cleaner this way :)

(Nov 01 '13 at 17:57) Sander Dieleman

Will it work to just halve the learning rate while using dropout fine-tuning?

answered Dec 29 '13 at 16:15

drgs's gravatar image

drgs
393

No, because that will affect the bias terms as well, and only the weights should be affected

(May 13 '14 at 06:35) elarosca
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.