|
I wanted to get something cleared up regarding the training of neural networks with dropout using minibatch gradient descent. The original paper by Hinton et al. says:
This implies that a dropout mask is sampled for every training example. Since an update is computed based on a minibatch of K training examples, that means K different masks are sampled for this update. In a more recent paper by Goodfellow et al. on Maxout networks:
Based on this, it would make more sense to sample only a single dropout mask for the given minibatch, and use the same one for all examples in the minibatch (since this update is then effectively operating on a single model, and not K different models). But on the next page of this paper:
So then it sounds like K different dropout masks are used, and thus K different models are updated. I guess I'm just a bit confused by the wording and terminology, so I'd like to know which method is 'correct' (i.e. how it's usually done): do I sample a new dropout mask per example or per update? Right now I'm leaning towards the former, which would probably be easier to implement as well. I imagine both approaches will work (I haven't tried it yet), but maybe one works significantly better than the other. Just to clarify: I know that I need to sample a fresh dropout mask every time the same training example is reused (i.e. keeping it constant throughout training is incorrect). |
|
Per example is how it is usually done. The intuition for why it should be better is the same as for why in SGD it makes sense to get a fresh minibatch of data instead of doing a second update on the same minibatch. |