Confidence weighted online learning (introduced in Mark Dredze et al, Confidence Weighted Linear Classification) assigns a gaussian prior to the weights of a classifier effectively adding not just regularization but a confidence weight to the gradients: the learner is less confident about less frequent features and is willing to change them more per iteration. So the update is no longer x(t+1) = x(t) - lrate * v_i * y_i but x(t+1) = x(t) - lrate* y_i * Sigma v_i, where Sigma is an approximation to the covariance matrix of the features (and is updated at every round by adding/subtracting some terms to the inverse covariance matrix).

This seems close to quasi-newton methods, where you update by doing a line search in the direction determined by the inverse hessian (so x(t+1) = x(t) - p * y_i * H^(-1) v_i , where p is determined by a line search) and the hessian is updated sort-of-additively at each round.

It seems to me as if there's a connection between quasi-newton methods and confidence-weighted learning, where the inverse covariance matrix in the confidence-weighted learning algorithm acts roughly as a stant-in to the inverse hessian (and the inverse covariance matrix acts as a stand-in to the hessian). Of course, confidence-weighted is stochastic and quasi-newton methods are deterministic (so I maybe abused notation in describing them as I did).

Does this make sense? Does this help explain the very good performance of confidence-weighted learning algorithms? Can this be used to generalize confidence-weighted learning as a stochastic quasi-newton method? Or would just any algorithm that uses second-order information look like this?

asked Sep 10 '10 at 17:41

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1893744214333


2 Answers:

I think the analogy makes sense and if you squint a little, it has to do with the fact that the inverse of the Hessian is asymptotically the covariance matrix of parameters you are trying to learn. Also note that similar techniques used to speed up quasi-newton methods can also be applied to the CW framework as well. There is some follow-up work of the CW paper (from this year's AISTATS: Exploiting Feature Covariance in High-Dimensional Online Learning) for efficiently updating the inverse covariance matrix used by the CW algorithm. One of the proposed approaches they describe (the buffering approach) looks very much like limited memory BFGS.

answered Sep 10 '10 at 19:54

spinxl39's gravatar image

spinxl39
3458104368

It's nice to know that I'm not being naive/insane with this. Have you got a reference for the relationship between the covariance matrix and the hessian? Does it apply only in the linear case or also in the nonlinear models?

(Sep 10 '10 at 20:23) Alexandre Passos ♦
1

I'd correct myself -- covariance is negative of the Hessian. I may be wrong but here is my understanding of the equivalence: when you do maximum likelihood estimation, the parameter's covariance is the Fisher information which in turn is the negative of the inverse of the expected Hessian of the likelihood function. In the quasi-newton method, you have a loss function (instead of a likelihood function) that you are minimizing, so there would be a sign change so I said that covariance is just the Hessian.

(Sep 10 '10 at 20:37) spinxl39

I see. Thanks :-)

(Sep 10 '10 at 20:38) Alexandre Passos ♦

Confidence-weighted learning minimizes something very similar to the natural gradient (as I explain in this blog post), and there are techniques that explore minimizing the natural gradient using newton's method, as explained in Le Roux et al A fast natural newton method, in ICML 2010.

answered Oct 13 '10 at 19:11

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1893744214333

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.