How do I add gaussian regularization to SGD CRF optimization?

The unregularized gradient at parameter i is (obesrved_count - expected_count), and the regularized gradient is (observed_count - expected_count - current_value/sigma).

Can I use the same gradient for SGD, or do I have to scale it somehow because I am looking at only b instances for each update?

asked Nov 04 '10 at 12:02

yoavg's gravatar image

yoavg
69671825


2 Answers:

The subtraction of a part of the parameters is part of the gradient of an error function that includes a regularization term that takes the form of the square of the parameters. This gradient is what you would use with any gradient based optimizer so also SGD.

EDIT: I stand corrected here. Since the likelihood has the form of a summation over the datapoints + 1/(2sigma) * w^T w and the sum is over a subset in SGD it makes sense to work with 1/(2sigma*(datasize/b)) instead indeed. Thanks for enlightening me Alexandre Passos.

answered Nov 04 '10 at 12:53

Philemon%20Brakel's gravatar image

Philemon Brakel
153092244

edited Nov 06 '10 at 09:35

but the regularization term is subtracted from the (log) likelihood of the entire dataset, while in SGD the error function is just for b instances.

It seem that the regularization term need to be scaled somehow (my guess is to divide by datasize/b).

(Nov 06 '10 at 05:31) yoavg

Yes. What is usually done to justify this, I think (and also to make parameter estimation more sensible when you don't necessarily know how many examples you have, as they can be stochastically generated, for example) is minimizing the mean negated log likelihood of the examples plus 1/2sigma * ||w||. In this setting a stochastic gradient makes sense, as it's just the mean negative log likelihood as computed in a sample from the training data.

This will also lead to very different values of sigma, precisely dividing it by n, as you mentioned.

answered Nov 06 '10 at 07:33

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.