|
Srivastava argues in his thesis Improving Neural Networks with Dropout that the weights of pre-trained dropout nets have to be scaled before fine-tuning. Hes says:
For me it is not clear what to scale the weights for. Scale the weights before testing to maintain the same expected output or scale the weights before finetuning (which uses dropout itsself) to get the same expected output. |
|
Dropout turns off, say, 50% of the units during training, but when you actually use the NN, you turn all of the units back on. Therefore, the average activity will double. To compensate, you can divide the weights by 2. Thanks Max. For testing I have no doubt to scale the weights with p, which usually 50%. But I use pretraining and I want to finetune the network. Must I scale the network with 1/p ( the inverse ) as Srivastava( if I understood him right ) before fine tuning? And if I have to, why?
(Nov 01 '13 at 12:08)
gerard
1
Yes, for the reason I just stated above. You want the NN to behave as if you are still doing dropout, but only on average.
(Nov 01 '13 at 12:20)
Max
Ok thanks Max. I think I got it now. I will scale my network with p after pretraining anyway. No matter what I am doing (finetuning or testing). But when the network is pretrained without dropout, I interpret that Srivastava recommends to scale it with 1/p for finetuning with dropout.
(Nov 01 '13 at 16:10)
gerard
3
I actually prefer to scale the weights in the other direction when dropout is used during training, so that inference at test time is unmodified, regardless of whether dropout was used during training. The same practice is adopted in Dropout Training as Adaptive Regularization by Wager et al. Of course both approaches are entirely equivalent, it's a matter of preference I suppose. I think the code is cleaner this way :)
(Nov 01 '13 at 17:57)
Sander Dieleman
|