I have a convex joint likelihood function L(w) and a non-convex marginal likelihood function J(w), where 'w' is the vector of parameters of a CRF. The difference between these is that in J(w) some variables are latent (i.e. integrated out), while all variables are observed in L(w). The number of instances in L(w) is small, while the number of instances in J(w) is large.

Now I want to interpolate these functions and add a regularizer, in order to optimize predictions on unseen data. That is I want to minimize I(w) = aL(w) + (1-a)J(w) + b*||w||^2, where the hyper-parameters are tuned on a held out set. The approach I've taken so far is to optimize the interpolated function with stochastic gradient descent with the parameters initialized to the zero vector. This seems to be somewhat unstable, however, so I'm looking into seeding I(w) with the minimizer of the convex function.

Does this seem like a reasonable approach and if so, does anyone have practical advice on how to optimize such combinations of convex and non-convex functions?

asked Nov 24 '10 at 09:36

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
1459102743


One Answer:

There is recent study in solving exactly this problem. See Exponential family hybrid semi-supervised learning, by Agarwal and Daumé III.

answered Nov 24 '10 at 10:22

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

Thanks for the pointer. That paper describes a slightly different problem of combining generative and discriminative models. The problem I'm looking at is how to interpolate two discriminative models, where one contains latent variables that correspond to observed variables in the other. It seems to me that it should be possible to use a simpler interpolation in my case? The parameters are in the same space already, although the optima of the marginal and joint likelihoods might not be aligned.

(Nov 24 '10 at 11:16) Oscar Täckström
1

Then I suggest you either use two sets of weights (one for each model) and solve a single optimization problem regularizing to minimize their differences (like minimizing Likelihood1(w1) + Likelihood2(w2) + ||w1-w2||) or just use a single w as you suggested. Maybe something like dual decomposition http://www.aclweb.org/anthology/D/D10/D10-1125.pdf will work better.

Besides this I have nothing to say, so we should wait for someone more knowledgeable.

(Nov 24 '10 at 11:23) Alexandre Passos ♦

That is another idea I've considered. Thanks for the link to the Koo et al. paper, I didn't know of that work or dual decomposition methods. It seems interesting.

(Nov 24 '10 at 13:51) Oscar Täckström
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.