|
In one of his lectures G. Hinton said (paraphrased) that reducing the learning rate tends to bring short-term benefits, but may lead to worse results in the long-term, and so it makes sense to reduce the learning rate when the training is almost finished. I observed this in my experiments with deep models as well. Is there a mathematical justification for this? Is the underlying cause the noise in the mini-batch estimate of the gradient, or alternatively, the tendency of the first-order methods to oscillate? Edit In their recent paper "No more pesky learning rates", Schaul et. al. propose a mathematical model of SGD, but I don't think the short-term vs long-term optimality decision comes out of it naturally. |
|
They call it is Misadjustment. Something that we learn back in the first days of the LMS algorithm. If your learning rate is too big you will keep bouncing around the minima, but if your learning rate is too small you will get a long time to even get close to there. I'm not asking why there is an optimal learning rate, but why the optimal choice depends strongly on how much training time remains.
(Jan 28 '14 at 13:24)
Max
I didn't answered about the optimal choice... I answered about the time that remais!!! Close to the minimum your gradient steps are so "big" that you can't reach the bottom of the cost surface. Think like that: you're close to the bottom of a locally convex function, close to the local minimum, and then you step in the direction of the minimum. If your step is too big, you end up on the other side of the of function, as far as you were before the step. Then you think and give a smaller step, so you don't pass by the minimum. Thats the intuition for annealing the learning rate. It's hard for me to show without a figure, I'm sorry...
(Jan 28 '14 at 14:29)
eder
|