1
1

Where can I find a publicly available description of the Hildreth qp optimization algorithm which is used in multiclass and structured MIRA, and in particular in Ryan McDonald's MST parser?

The relevant papers refer to a book by Censor and Zenios ("Parallel Optimization"), but this is a bit hard to find.

asked Aug 22 '11 at 12:32

yoavg's gravatar image

yoavg
741122331


One Answer:

(not a complete answer, just a comment that outgrew the comment box)

The longest description I can find is in the javadoc for javanlp from Stanford (javanlp seems to be a stanford-only superset of their corenlp library that includes all sorts of closed code). I can't seem to find the corresponding code, but the doc is as follows:

 Hildreth quadratic programming solver

 This code can solve any optimization problem of the form:

 max_x x'Cx + d'x

 subject to Gx <= h

 Procedure:

 Let: A  = (-1/4) G C^-1 G'
      b' = (1/2)  d 'C^-1 G' + h

 Solve min_z = z' A z + b'z

 subject to z >= 0

 Solved with really simple Gauss-Seidel like solver with
 clipping of components of z at 0 after each update

 After solving the above, recover x with

 x = 1/2 C^-1(G'z - d)

 The procedure is described in Clifford Hildreth's
 A quadratic programming procedure in Naval
 Research Logistics Quarterly 4(1) 1957

 There is additional discussion in and generalization to arbitrary
 convex functions with continuous derivatives in D.A. D'Esopo paper
 A convex programming procedure Naval Research Logistics Quarterly 6(1) 1959

 However, the current implementation comes directly from Hildreth.

 Notes: This is an absolutely ancient algorithm that is still popular
 within the machine learning community for solving
 quadratic programming (QP) problems. It's continued usage is in
 no-doubt due to the extreme simplicity of the algorithm.

 Variants of the procedure are used by SVMLight's default QP
 solver and the MIRA implementation packaged with the MSTParser.

This, couple with the open availiable code could be enough to understand things. Unfortunately all papers I saw on the method seem to be behind horrible Wiley paywalls,

I think given these descriptions it should be possible to hack a solver for this.

answered Aug 22 '11 at 13:15

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

1

Also, in Aria Haghighi's blog ( http://aria42.com/blog/?p=216 ) there is an explanation of how to derive an analytic expression for the solution to the MIRA problem as long as you can do loss-augmented inference. It's the same derivation as in the passive-aggressive paper ( http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.5120&rep=rep1&type=pdf ), except here the problem is slightly different.

I don't really understand why MSTparser doesn't just use the analytic PA update. Can you explain that?

(Aug 22 '11 at 13:41) Alexandre Passos ♦

Thanks. I already have the code (and a quick and dirty cython port based on that), but I want to actually understand the method..

Aria describes the 1-best MIRA variant, which has a closed form solution, and which is equivalent to the PA algorithm (though the PA paper update against sqrt(loss), which is needed for the analysis).

In practice MSTParser could work just as well with the PA update, but in the original papers and problem formulation, you want to separate the correct tree from the top k incorrect trees, not just from the top-1 incorrect trees, and there the PA update is not sufficient (though it turns out empirically that for this kind of dependency parsing, updating against the 1best tree really is good enough).

(another reason is that historically Koby Crammer came up with MIRA before PA)

(Aug 22 '11 at 13:52) yoavg

I've also been looking for the paper describing Hildreth's algorithm but could never find it so the above explanation is more than welcome. Thanks!

To answer your question, the analytical solution exists only when you have a single constraint. If you look at the equations in Aria's blog, you'll see that the only constraint involves y* (the predicted label). In structured prediction, it is common to use k constraints for the k-best results returned by the decoder (usually implemented by Dynamic Programming such as the Viterbi algorithm). For k > 1, you need a numerical QP solver. If I rembember correctly, Ryan McDonald reports better results with k-best than with 1-best, although the difference was not that big.

By the way, something I noticed a while back, the constraints involve the predictions BEFORE the weight vector is updated. This is a hack because in theory the constraints should involve predictions made with the weight AFTER update but I don't see how this could be done in a tractable way. The same hack is used for multiclass MIRA.

The Hildreth's algorithm is also implemented in http://www.seas.upenn.edu/~strctlrn/StructLearn/StructLearn.html if I remember correctly.

(Aug 22 '11 at 14:13) Mathieu Blondel

Looks like yoavg was faster than me to answer! yoavg, I'm potentially interested in your cython code if you can share it in a gist. Thanks!

(Aug 22 '11 at 14:18) Mathieu Blondel

If the only addition is k constraints instead of one I'd just throw projected gradient ascent in the dual at this (or even dual coordinate ascent) and see what happens.

If the primal problem is (with f(x,y) being the feature vector of x with labels y; and y0 being the true label here)

min. ||w - w0||² s.t. w·(f(x,y0) - f(x,yi)) > l(x,yi)

then the dual problem is

max. w0·(sum_i lambda_i (f(x,yi) - f(x,y0))) - (1/4) sum_ij lambda_i lambda_j (f(x,yi)-f(x,y0))·(f(x,yj)-f(x,y0)) - sum_i lambda_i l(x,yi)

s.t. lambda_i >= 0

and the solution is w = w0 - (1/2) sum_i lambda_i (f(x,i) - f(x,0)). I'd probably try just projected gradient descent on lambda_i or coordinate ascent on lambda_i; both should be easy enough to implement and should converge quickly; at worst you can use l-bfgs-b (which supports positivity constraints) on the dual, which will definitely be fast enough.

(Aug 22 '11 at 14:51) Alexandre Passos ♦
2

@Mathieu Blondel: I have put my cython code online: hildreth.pyx It is a straightforward adaptation of the java code, not optimized at all, but seems to work fine. Hope you find it useful.

(Aug 23 '11 at 09:37) yoavg

@Alexadre Passos: what's projected gradient? do you have a good pointer?

(Aug 23 '11 at 09:43) yoavg

@yoavg: projected gradient is just "do a gradient step and then project to fit the constraints". Another name is forward-backward splitting, I think ( see http://www.cs.berkeley.edu/~jduchi/projects/DuchiSi09c.pdf for an example). Usually projecting is a bit of work, but in this dual all you have to do is make the negative lambdas go to zero (or just not do a negative gradient step if lambda is already 0).

(Aug 23 '11 at 09:44) Alexandre Passos ♦

@Alexandre Passos: really? just do GD and clip at zero? I am constantly amazed at how many of the ad-hoc heurstic I tried once actually have fancy names! :)

(any suggestions regarding choosing the step size?)

(Aug 23 '11 at 14:15) yoavg
showing 5 of 9 show all
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.