In NLP and related areas, we often have to deal with difficult optimization problems, which are:

  • large-scale, i.e. involve 10k, 100k or more parameters to optimize, and
  • their gradients and function value are expensive to compute, require expensive dynamic-programming algorithms etc.

For large-scale problems, L-BFGS is a good choice, but it typically requires many passes over the data (often 100 or more), which is prohibitive for expensive-to-compute objective functions.

Online methods are an alternative because they often need orders of magnitude fewer passes over the data to converge.

What are the best online algorithms? Are there good, ready-to-use online-learning libraries, which I could just use and plug in my gradients and function value code?

Typically, if I need an online learning algorithm, I use my own simple stochastic gradient descent (SGD) or RProp implementation, but sometimes that's not good enough. Especially with SGD, it's often too hard to find and tune a good schedule for the learning rate by hand (it requires many different runs and evaluations on dev data, which makes it as expensive as L-BFGS, although parallelizable).

The paper "A stochastic quasi-Newton method for online convex optimization" by Schraudolph et al. (2007) describes an online version of LBFGS and other promising methods, but are there implementations out there? They are not as easy to implement as simple SGD. If they produce solutions about as good as L-BFGS using an order of magnitude fewer iterations, as the paper shows, why does no one in the NLP community seem to use them?

asked Jul 08 '10 at 01:05

Frank's gravatar image


edited Jul 08 '10 at 02:47

Joseph%20Turian's gravatar image

Joseph Turian ♦♦

4 Answers:

Leon Bottou, in his implementation of SGD, writes "The following results have been obtained using plain stochastic gradient. This page does not advocate fancy modifications of an old algorithm, it just shows that the old algorithm itself is amazingly fast!"

I suspect that what he's getting at is that people think that an algorithm as simple as SGD can't be as good as something as tricky to implement as L-BFGS. (When he talks about fancy modifications of an old algorithm, I think he is alluding to Schraudolph's stochastic meta descent, which in 2008 was getting some attention.) It's a prevalent cognitive bias, especially in science, to equate tricky with good.

The more ML-savvy NLP researchers have caught on that SGD is a great optimization algorithm. For example, search for SGD in the ACL anthology. They include Tsuruoka et al (2009) ("Experimental results demonstrate that our method can produce compact and accurate models much more quickly than a state-of-the-art quasi- Newton method for L1-regularized loglinear models.") and Finkel et al (2008) ("We found no difference in parser performance between using stochastic gradient descent (SGD), and the more common, but significantly slower, L-BFGS.").

You are correct that choosing SGD hyperparameters, like the learning rate, can be a pain in the ass. Leon Bottou describes some tricks for choosing the hyperparameters. If you really want good generalization, you have to try a bunch of different learning rates and wait for them to converge. If you want to do some fast and pretty good, you can try different hyperparameters on a small sample (e.g. 1000 examples) and see which decreases the objective the most after just 1000 updates. This gives you a good baseline, and you can explore the hyperparameter space more around this region if you so choose.

SGD is a great general-purpose optimization algorithm, and it is easy to implement. I would generally use it first, before trying something more complicated. I believe SGD is just as good as, if not superior, to L-BFGS in the not highly varying (and sometimes even convex) optimization surfaces common in current NLP models. (I would nonetheless be interested in a controlled comparison between SGD and the L-BFGS using the Berkeley cache-flushing trick.)

With a far more bumpy optimization surface, for example those present in deep models, there are stronger optimizations that might be far superior to SGD. Martens, 2010 ("Deep learning via Hessian-free optimization") recommends Hessian-free optimization (a 2nd-order optimization method).

answered Jul 08 '10 at 02:47

Joseph%20Turian's gravatar image

Joseph Turian ♦♦

Great answer, thanks! I like the tricks on Bottou's home page. We do often deal with very bumpy, non-convex problems though, e.g. in parsing or machine translation, when latent variables are used.

(Jul 08 '10 at 09:16) Frank

This paper ( http://www.cs.utoronto.ca/~tijmen/pcd/pcd.pdf ) uses a mixture of contrastive divergence and gradient descent to train RBMs, which are also highly non-convex with many bumps in the energy function and with latent variables. Maybe some of this translates to your models.

(Jul 08 '10 at 09:20) Alexandre Passos ♦

Two comments on joseph's answer: 1. one of my labmates downloaded leon bottou's code and found his implementation of SGD to be quite fast and copied it into java for us to use. so, completely independent verification that his tricks really do work. 2. despite the quote attributed to me in joseph's answer, i would still always pick L-BFGS over SGD if it was computationally feasible. judging convergence in SGD is not very well understood, and especially on larger datasets i think L-BFGS tends to do better. but i still use SGD when necessary, and agree with the previous comments about the fancy modification usually not being worth it. (i found SMD to be way slower than SGD in practice.)

answered Jul 08 '10 at 04:25

Jenny%20Rose%20Finkel's gravatar image

Jenny Rose Finkel

I agree with Joseph that the Martens (2010) paper looks like a good approach: It can handle noise (which makes it a good candidate for online learning) and it's second-order. If nothing else, the paper is a good source of interesting ideas and approaches that could be composed into a different method for a different problem.

I haven't tried Schraudolph et al. (2007). I will say that I had poor luck with Stochastic Meta-Descent (e.g., Vishwanathan et al., 2006). Depending on the values of the hyper-parameters, it either diverged entirely or performed identically to gradient descent. (My task was MLN weight learning -- a very ill-conditioned problem where online learning was not possible, so YMMV.)

Have you tried just running L-BFGS on a small fraction of the data to get near the optimum, and then fine-tuning it with the full dataset?

answered Jul 08 '10 at 04:21

Daniel%20Lowd's gravatar image

Daniel Lowd

For regularized linear classification, vowpal wabbit has already been cited a few times on this site but there is also liblinear with both primal and dual optimizers and SGD-QN that further adds second order info to the SGD implementation by Leon Bottou.

For training CRFs (sequence labeling), crfsuite implements efficiently a fast online and regularized SGD. I think the Mallet library also has an online SGD based optimizer for training CRFs in java but probably less efficient than crfsuite.

answered Jul 08 '10 at 04:29

ogrisel's gravatar image


Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.