I have been spending some time on trying to get dropout to work. My first step is MNIST and I am not able to reproduce the results. So far, I think the best I got was about 250 wrongly labeled examples on the test set compared to their ~115 with the same architecture, an MLP with two hidden layers of the size of 800.
Looking at the appendix, there are some details on training which I feel are not clear, at least to me. Also, I'd be interested in some practical experiences in case you have any. In case you have successfully reproduced the results or know the authors, maybe you can help me.
- What is an epoch in their case? To my understanding, an epoch is one pass over the training set. My feeling is however that they refer to one weight update instead, because the formulas given indicate so: t is incremented after each weight update and the schedules for learning rates are enumerated with t as well.
- Which transfer function is being used? My guess would be the sigmoid, since it is what you get out of an RBM pretrained MLP, which they also give results for.
- How sensitive is the algorithm towards all the tricks? By tricks I mean (a) weight constraints and large initial learning rate, (b) "unconventional" entangling of step rate and momentum term, (c) drop out itself, (d) exact values for the hyperparameters of the learning rate and momentum schedule and (e) exact value for the maximum squared weight length.
Update
4. Is the official split used? Or is the validation set part of the training set? (Since no early stopping is performed.)
asked
Sep 30 '12 at 13:42
Justin Bayer
1706●9●30●45