|
I am currently implementing an SGD optmizer for CRF, but the questions if for SGD in general: - are there any rules of thumb for choosing the step-size? What's a good way of specifying an adaptive step-size? |
|
Take a look at Léon Bottou's SGD implementation of a linear chain CRF. He determines the initial learning rate empirically (see CrfSgd::calibrate method in crfsgd.cpp) and then uses the same schedule as PEGASOS, eta = 1 / (lambda * t) where lambda is the regularization parameter, t is a floating point number such that for the first observed example, eta equals the initial learning rate. For each example, t is incremented by 1. To find the initial learning rate, Bottou uses a sample of 1000 examples. He starts at eta=0.1 and divides or multiplies by 2 to try new values. |
|
This is dependent on the problem domain and the type of model you are training but often schemes are used where the step-size is being decreases linearly or proportionally or adapted based on the error. One downside of SGD is that this is a bit of an art in itself. The paper by LeCun et al. on efficient backpropagation training discusses this subject and also suggests some schemes that incorporate second order gradient information. For a fixed learning rate you often want to evaluate your results on a small part of the data set first and than lower it by an order of magnitude for the full data set. the LeCun paper looks very relevant. Thanks!
(Nov 04 '10 at 12:46)
yoavg
|
|
Crfsuite does something recommended in Yann LeCun's paper that really helps, which is trying many different step sizes prior to starting every batch, and choosing the one with better error reduction in a small (~100 or so, I think) sample of the training points. when you say "batch", do you mean an iteration over the training set? also, what do you mean by error reduction? is it reduction in the objective function being minimized, or empirical error on dev set? and how do you choose the step sizes to try?
(Nov 06 '10 at 05:34)
yoavg
By "batch" I mean iteration over the training set, yes. By "error reduction" I mean reduction on the (regularized) objective. About the step size, usually something a couple of orders of magnitude above and below the last step size should work.
(Nov 06 '10 at 07:27)
Alexandre Passos ♦
|