I currently use Nesterov's accelerated gradient to speed up learning MNIST on a neural net:

avg grad(t) = grad(momentum*v(t) + w(t))/batchsize

v(t+1) = momentum*v(t) - (learning rate * avg grad(t))

w(t+1) = w(t) + v(t+1),

where t denotes time, v is the momentum vector, w are the weights.

I use 800 units with logistic + softmax + cross entropy; dropout [0.2,0.5]; learning rate 0.1; and I start with momentum 0.5 and increase by 0.01 per epoch; batch size 100 with stochastic gradient descent. I use no weight decay.

  1. Now when I train my net for 500 epochs without NAG I get a cross validation error (0.2 of data) of about 1.7%.

  2. When I train my net for 500 epochs with NAG I get a cross validation error of about 2.1%.

  3. When I train my net for 400 epochs with NAG and 100 epochs without I still get about 2.1% cross validation error.

Training is faster when I use NAG but I get worse local minimas – why is this so?

According to the literature:

  1. With the same number of epochs, momentum should yield better results in general, i.e. better local minima (dropout paper, momentum paper)
  2. Setting momentum to zero after one is close to the local minima should improve results (momentum paper)

It could be that additional regularization is needed to make momentum work. Or that I need to run the network for more epochs (the dropout paper used 3000 epochs). However, why do I get better local minima and get faster to these local minima without NAG, when it should not be so?

asked Oct 13 '13 at 06:43

Tim%20Dettmers's gravatar image

Tim Dettmers
6114

Why don't you use the momentum schedule from the paper?

(Oct 15 '13 at 09:16) Justin Bayer

It's not clear to me what is your stopping criterion. Are you using performance on a held-out set? If not, it might be that you're overfitting or underfitting.

(Oct 15 '13 at 14:59) Alexandre Passos ♦

I assume that the momentum schedule I use here should work in a similar way and should not hamper NAG too much. Steady increase of momentum and then cap it at something like 0.99 is quite standard.

Sorry for being unclear about the stopping criterion: I use early stopping.

(Oct 17 '13 at 16:06) Tim Dettmers

2 Answers:

Things to keep in mind:

  • The momentum paper is concerned with optimization. That means the whole overfitting issue is willfully ignored.
  • The paper also addresses deep architectures, not shallow ones, such as a net with one hidden layer.

For the sake of it, I just did some experiments. I ran with the momentum schedule from the paper and a final momentum of .99 and a step rate of 0.1. Parameters were initialized by drawing from N(0, 0.1). 150 epochs with 250 sized minibatches. Picked the net with the best validation score, I report test scores. Logistic activations, softmax+cross entropy. I did not perform a fixed number of epochs with momentum 0.9, as recommended in the paper. Got 175 errors.

I thus cannot confirm your findings on NAG, but maybe the momentum schedule is the point.

answered Oct 15 '13 at 16:56

Justin%20Bayer's gravatar image

Justin Bayer
170693045

Thanks for the running the experiment. I just found a bug in the bias gradients and now get 170 errors. What one can learn from this is that a gradual NAG update schedule works well, even in shallow architectures (the momentum update schedule in the paper might yield better results, of course). So NAG works well, even near local minima.

While Max hinted at bugs in my code, I think that your falsification of my 'weird momentum behavior near local minima' hypothesis was most useful, and so I mark your post as the answer.

(Oct 18 '13 at 10:49) Tim Dettmers

Things might fail to work for any number of reasons, such as bad luck with hyperparameters, or bugs, but I noticed one thing that seems clearly wrong:

I start with momentum 0.5 and increase by 0.01 per epoch ... I train my net for 500 epochs

This means that your momentum goes from 0.5 to 5.5. However, it's supposed to be strictly less than 1.

answered Oct 13 '13 at 07:09

Max's gravatar image

Max
476162729

edited Oct 13 '13 at 07:13

Sorry for being unclear: I cap the momentum at 0.99. But there may be bugs present as Justin could not replicate this behavior. I still see if everything is in order with the code and will check back.

(Oct 17 '13 at 16:07) Tim Dettmers
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.