5
3

I have been spending some time on trying to get dropout to work. My first step is MNIST and I am not able to reproduce the results. So far, I think the best I got was about 250 wrongly labeled examples on the test set compared to their ~115 with the same architecture, an MLP with two hidden layers of the size of 800.

Looking at the appendix, there are some details on training which I feel are not clear, at least to me. Also, I'd be interested in some practical experiences in case you have any. In case you have successfully reproduced the results or know the authors, maybe you can help me.

  1. What is an epoch in their case? To my understanding, an epoch is one pass over the training set. My feeling is however that they refer to one weight update instead, because the formulas given indicate so: t is incremented after each weight update and the schedules for learning rates are enumerated with t as well.
  2. Which transfer function is being used? My guess would be the sigmoid, since it is what you get out of an RBM pretrained MLP, which they also give results for.
  3. How sensitive is the algorithm towards all the tricks? By tricks I mean (a) weight constraints and large initial learning rate, (b) "unconventional" entangling of step rate and momentum term, (c) drop out itself, (d) exact values for the hyperparameters of the learning rate and momentum schedule and (e) exact value for the maximum squared weight length.

Update 4. Is the official split used? Or is the validation set part of the training set? (Since no early stopping is performed.)

asked Sep 30 '12 at 13:42

Justin%20Bayer's gravatar image

Justin Bayer
170693045

edited Sep 30 '12 at 14:28


One Answer:
  1. I expect an epoch is a full pass over the training set.
  2. I think logistic sigmoid is correct.
  3. I have gotten dropout to work without using weight constraints or anything special. I would argue that the momentum formula they use disentangles step size and momentum more than the traditional formula. But either way, you can convert a momentum and learning rate from one convention to the other. I use the more traditional formula. I haven't used dropout on the datasets they try so I don't know how large a region of hyper-parameter space produces good results. In general, always optimize your hyper-parameters. If you want to make things simple, only try dropout rates for each layer in {0, 0.25, 0.5}. Since dropout really only helps prevent overfitting, you should use a high capacity neural net with plenty of hidden units.

answered Sep 30 '12 at 16:52

gdahl's gravatar image

gdahl ♦
341453559

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.