0
1

In one of his lectures G. Hinton said (paraphrased) that reducing the learning rate tends to bring short-term benefits, but may lead to worse results in the long-term, and so it makes sense to reduce the learning rate when the training is almost finished.

I observed this in my experiments with deep models as well.

Is there a mathematical justification for this? Is the underlying cause the noise in the mini-batch estimate of the gradient, or alternatively, the tendency of the first-order methods to oscillate?

Edit In their recent paper "No more pesky learning rates", Schaul et. al. propose a mathematical model of SGD, but I don't think the short-term vs long-term optimality decision comes out of it naturally.

asked Jan 27 '14 at 18:19

Max's gravatar image

Max
476162729

edited Jan 28 '14 at 13:22


One Answer:

They call it is Misadjustment. Something that we learn back in the first days of the LMS algorithm. If your learning rate is too big you will keep bouncing around the minima, but if your learning rate is too small you will get a long time to even get close to there.
I think you can visualize what I'm talking about right? I just couldn't find right know a good figure or paper to show you, but you can always try to find Widrow's Adaptive Fitlers book to have a proper mathematical definition of misadjustment. At that book, they even provide you a formula for the LMS case as a function of the learning rate.

answered Jan 27 '14 at 23:02

eder's gravatar image

eder
2162511

edited Jan 27 '14 at 23:03

I'm not asking why there is an optimal learning rate, but why the optimal choice depends strongly on how much training time remains.

(Jan 28 '14 at 13:24) Max

I didn't answered about the optimal choice... I answered about the time that remais!!! Close to the minimum your gradient steps are so "big" that you can't reach the bottom of the cost surface. Think like that: you're close to the bottom of a locally convex function, close to the local minimum, and then you step in the direction of the minimum. If your step is too big, you end up on the other side of the of function, as far as you were before the step. Then you think and give a smaller step, so you don't pass by the minimum. Thats the intuition for annealing the learning rate. It's hard for me to show without a figure, I'm sorry...

(Jan 28 '14 at 14:29) eder
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.