|
I currently use Nesterov's accelerated gradient to speed up learning MNIST on a neural net: avg grad(t) = grad(momentum*v(t) + w(t))/batchsize v(t+1) = momentum*v(t) - (learning rate * avg grad(t)) w(t+1) = w(t) + v(t+1), where t denotes time, v is the momentum vector, w are the weights. I use 800 units with logistic + softmax + cross entropy; dropout [0.2,0.5]; learning rate 0.1; and I start with momentum 0.5 and increase by 0.01 per epoch; batch size 100 with stochastic gradient descent. I use no weight decay.
Training is faster when I use NAG but I get worse local minimas – why is this so? According to the literature:
It could be that additional regularization is needed to make momentum work. Or that I need to run the network for more epochs (the dropout paper used 3000 epochs). However, why do I get better local minima and get faster to these local minima without NAG, when it should not be so? |
|
Things to keep in mind:
For the sake of it, I just did some experiments. I ran with the momentum schedule from the paper and a final momentum of .99 and a step rate of 0.1. Parameters were initialized by drawing from N(0, 0.1). 150 epochs with 250 sized minibatches. Picked the net with the best validation score, I report test scores. Logistic activations, softmax+cross entropy. I did not perform a fixed number of epochs with momentum 0.9, as recommended in the paper. Got 175 errors. I thus cannot confirm your findings on NAG, but maybe the momentum schedule is the point. Thanks for the running the experiment. I just found a bug in the bias gradients and now get 170 errors. What one can learn from this is that a gradual NAG update schedule works well, even in shallow architectures (the momentum update schedule in the paper might yield better results, of course). So NAG works well, even near local minima. While Max hinted at bugs in my code, I think that your falsification of my 'weird momentum behavior near local minima' hypothesis was most useful, and so I mark your post as the answer.
(Oct 18 '13 at 10:49)
Tim Dettmers
|
|
Things might fail to work for any number of reasons, such as bad luck with hyperparameters, or bugs, but I noticed one thing that seems clearly wrong:
This means that your Sorry for being unclear: I cap the momentum at 0.99. But there may be bugs present as Justin could not replicate this behavior. I still see if everything is in order with the code and will check back.
(Oct 17 '13 at 16:07)
Tim Dettmers
|
Why don't you use the momentum schedule from the paper?
It's not clear to me what is your stopping criterion. Are you using performance on a held-out set? If not, it might be that you're overfitting or underfitting.
I assume that the momentum schedule I use here should work in a similar way and should not hamper NAG too much. Steady increase of momentum and then cap it at something like 0.99 is quite standard.
Sorry for being unclear about the stopping criterion: I use early stopping.