I am currently implementing an SGD optmizer for CRF, but the questions if for SGD in general: - are there any rules of thumb for choosing the step-size? What's a good way of specifying an adaptive step-size?

asked Nov 04 '10 at 11:58

yoavg's gravatar image


edited Nov 04 '10 at 11:59

3 Answers:

Take a look at Léon Bottou's SGD implementation of a linear chain CRF. He determines the initial learning rate empirically (see CrfSgd::calibrate method in crfsgd.cpp) and then uses the same schedule as PEGASOS, eta = 1 / (lambda * t)

where lambda is the regularization parameter, t is a floating point number such that for the first observed example, eta equals the initial learning rate. For each example, t is incremented by 1.

To find the initial learning rate, Bottou uses a sample of 1000 examples. He starts at eta=0.1 and divides or multiplies by 2 to try new values.

answered Nov 08 '10 at 08:55

Peter%20Prettenhofer's gravatar image

Peter Prettenhofer

This is dependent on the problem domain and the type of model you are training but often schemes are used where the step-size is being decreases linearly or proportionally or adapted based on the error. One downside of SGD is that this is a bit of an art in itself. The paper by LeCun et al. on efficient backpropagation training discusses this subject and also suggests some schemes that incorporate second order gradient information. For a fixed learning rate you often want to evaluate your results on a small part of the data set first and than lower it by an order of magnitude for the full data set.

answered Nov 04 '10 at 12:36

Philemon%20Brakel's gravatar image

Philemon Brakel

edited Nov 04 '10 at 13:30

the LeCun paper looks very relevant. Thanks!

(Nov 04 '10 at 12:46) yoavg

Crfsuite does something recommended in Yann LeCun's paper that really helps, which is trying many different step sizes prior to starting every batch, and choosing the one with better error reduction in a small (~100 or so, I think) sample of the training points.

answered Nov 04 '10 at 17:40

Alexandre%20Passos's gravatar image

Alexandre Passos ♦

when you say "batch", do you mean an iteration over the training set?

also, what do you mean by error reduction? is it reduction in the objective function being minimized, or empirical error on dev set?

and how do you choose the step sizes to try?

(Nov 06 '10 at 05:34) yoavg

By "batch" I mean iteration over the training set, yes. By "error reduction" I mean reduction on the (regularized) objective. About the step size, usually something a couple of orders of magnitude above and below the last step size should work.

(Nov 06 '10 at 07:27) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.