Given: You have only a single training example, that is, a single pair of inputs and outputs. For instance, in a 2-3-1 feed-forward neural network you have the example { 0.5, 1 } -> {0.5}. You're using the canonical backprop example (no momentum, jitter, etc.).

My question is: is there any difference in doing a lot of epochs of backprop with a learning rate of say 0.1 and doing a single epoch of backprop with a learning rate of 1?

asked Dec 06 '11 at 19:48

Wesley%20Tansey's gravatar image

Wesley Tansey
206259


3 Answers:

The only reason there could be no difference is if the gradient of your error function after giving a step of size 1 was still the same as before. This will only happen in general with a linear function, so any kind of curvature implies a difference, and the bigger the curvature (measured by the second derivative, or Hessian) the bigger the difference. Neural networks are necessarily nonlinear because of the loss function and the hidden layers. There's always a tradeoff in choosing the step size, however, as step sizes too small will make any difference in performance be lost in numerical error and will make you need to take orders of magnitude longer to optimize your neural network than step sizes too large, which will be unstable and can easily throw you from a nice direction in space to a pretty bad one.

answered Dec 07 '11 at 05:37

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1899744214335

Alexandre said most of this already, but I want to emphasize that even if you are doing linear regression with squared error, they are different and that if they ever aren't different, training can never converge. Suppose you have a 1d example where you are learning the function f(x; w) = w*x and you have one training case (x1, t). Your gradient is (d/dw) 0.5 * ( wx - t)^2 = (wx - t)*x. So, since grad = w*x^2 - t*x, you CHANGE w by taking a step of size alpha along the negative gradient evaluated at the current parameter(s), then the parameter(s) are now different and the gradient changes.

Here is another way of showing they must be different. If they weren't different, training could never converge because the updates would never get smaller.

answered Dec 08 '11 at 01:39

gdahl's gravatar image

gdahl ♦
15051633

edited Jan 05 at 15:49

There will be. A larger learning rate will "bounce around" in a cycle above the minimum energy function, even if it's very close to it. The learning rate much go to zero for convergence to occur, but if it doesn't then a smaller learning rate will still be better off.

It's also the case that the local gradient may not reflect the global function, so a large stepsize will turn out to improve your objective by quite a bit less than ten smaller step sizes.

answered Dec 06 '11 at 20:46

Jacob%20Jensen's gravatar image

Jacob Jensen
1644285360

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.