In NLP and related areas, we often have to deal with difficult optimization problems, which are:
- large-scale, i.e. involve 10k, 100k or more parameters to optimize, and
- their gradients and function value are expensive to compute, require expensive dynamic-programming algorithms etc.
For large-scale problems, L-BFGS is a good choice, but it typically requires many passes over the data (often 100 or more), which is prohibitive for expensive-to-compute objective functions.
Online methods are an alternative because they often need orders of magnitude fewer passes over the data to converge.
What are the best online algorithms? Are there good, ready-to-use online-learning libraries, which I could just use and plug in my gradients and function value code?
Typically, if I need an online learning algorithm, I use my own simple stochastic gradient descent (SGD) or RProp implementation, but sometimes that's not good enough. Especially with SGD, it's often too hard to find and tune a good schedule for the learning rate by hand (it requires many different runs and evaluations on dev data, which makes it as expensive as L-BFGS, although parallelizable).
The paper "A stochastic quasi-Newton method for online convex optimization" by Schraudolph et al. (2007) describes an online version of LBFGS and other promising methods, but are there implementations out there? They are not as easy to implement as simple SGD. If they produce solutions about as good as L-BFGS using an order of magnitude fewer iterations, as the paper shows, why does no one in the NLP community seem to use them?