1
2

What is the form of the gradient in an arbitrary CRF that may be chain-, tree-shaped or loopy? I'm talking about the gradient of the (negative) log-likelihood w.r.t. the feature vector.

I am trying to implement such a CRF and train the feature weights using simple gradient descent (or, alternatively, RProp). The implementation is relatively straightforward, but I'm getting strange results when the graph is tree-shaped (haven't tried loopy yet). The gradient norm keeps decreasing in the first few iterations of GD or RProp, which is good, but then it keeps getting bigger. So I'm wondering if I have a wrong understanding of how to compute the gradients.

I was assuming that in the tree-shaped case, the gradients are basically computed in the same way as the gradients in a traditional chain-structured CRF. I.e., just compute the feature expectations under p(Y*|X) (clamped case) minus the feature expectations under p(Y|X) (unclamped case), where Y are the variables that are unobserved at test time, and Y* are their true values. Is that right? (To compute the feature expectations I just use (loopy) BP, so they would be inexact in the loopy case.)

What about double-counted variables? Do I have to do anything special for variables that are connected to more than one factor? I'm thinking of the Bethe Free Energy, where you substract a term for each variable that is attached to more than one factor in the factor graph, see Heskes 2002, for example. Do I need to worry about the Bethe Free Energy at all, when training my arbitrary CRFs?

Another, related question: In the tree-shaped case I can compute the exact log-likelihood (computing the numerator and the denominator similar to a chain-structured CRF but using BP), right? So I should be able to use L-BFGS?

asked Aug 29 '10 at 07:35

Carl1234's gravatar image

Carl1234
16113

btw, to compute exact log likelihood in trees you could run BP to convergence, then use formula like on p.2 of this to get Z

(Aug 29 '10 at 16:19) Yaroslav Bulatov

@yaroslav: Thanks for the link. I guess you meant loopy graphs, not trees.

(Aug 29 '10 at 16:47) Carl1234

Loopy BP with parallel updates will converge after d iterations on any tree with diameter d, converged messages give you exact marginals and exact Z

(Aug 29 '10 at 17:29) Yaroslav Bulatov

2 Answers:

I don't think you need to worry about the Bethe free energy. Your formulation seems right, but you shouldn't need loopy BP to compute the gradients. The gradient is derived from the conditional log-likelihood L(Y|x) = Log(exp(w·f(x,Y))/sum_y(exp(w·f(x,y))) = w·f(x,Y) - Log(sum_y(exp(w·f(x,y))), so d/dw_i = f(x,Y)_i - sum_y(f(x,Y)_i exp(w·f(x,y)))/sum_y(exp(w·f(x,y)), where f(x,y) is the feature vector of the labeling x,y and f(x,y)_i is the i-th component of that vector. If you can compute the exact log-likelihood you can compute this gradient in the same algorithm, with no extra passes, I think.

I suggest you try optimizing with a fixed, preexisting, well-tested optimizer before using SGD. I like L-BFGS, and it has an implementation (l-bfgs-b, by Nocedal) that has been ported into almost every programming language in use today, it's just a matter of downloading and using it. To use it you just plug in the function and the gradient, and let it optimize. If it finds the minimum you can then move on to fixing your SGD code (or tuning the learning rates).

answered Aug 29 '10 at 07:52

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1893744214333

Great, thanks! I mentioned loopy L-BFGS just for loopy topologies; on a chain or tree I just run one forward and one backward pass, of course. After a bug fix it's all working now, incl. L-BFGS on chains and trees.

(Aug 29 '10 at 15:16) Carl1234

It's probably better to use a variational approximation than loopy bp for precisely these issues.

(Aug 29 '10 at 15:30) Alexandre Passos ♦

Another approach is pseudo-likelihood, you can compute gradient exactly and resulting estimator is consistent (unlike variational methods) http://arxiv.org/abs/1003.0691

(Aug 29 '10 at 19:52) Yaroslav Bulatov

As Alexandre says, "expected feature counts minus observed feature counts" is the gradient form for any CRF, and it follows by simple differentiation. Notably, second derivative of log-likelihood gives the variance matrix (ie, p.62 of Wainwright's "Graphical Models,...").

Are your expected feature counts getting closer to observed feature counts? For debugging, you can try to fix all parameters except 1 at their true values and just try to do gradient descent to see if it learns the remaining parameter correctly (ie, generate dataset where per-class feature distribution for all features except 1 is uniform, and set weights to 0 for those features)

You can use L-BFGS any time you can compute the gradient. BTW, apparently exponentiated gradient does much faster than L-BFGS for CRF's

When training arbitrary CRF's I'd be more worried about having graphs with high connectivity graphs/informative transition matrices. In such situation, Belief Propagation fails to converge, resulting gradient estimates after damping are poor, gradient descent can fail to converge, and resulting parameter estimates after damping are poor as well.

answered Aug 29 '10 at 13:45

Yaroslav%20Bulatov's gravatar image

Yaroslav Bulatov
1963193458

Thanks, I agree, generating data from the model and then decoding it is a great debugging method. I found that the strange behavior was due to a bug and rounding errors, and the optimization is working now. You write that you can always use L-BFGS if you can compute the gradient. But I think I won't be able to use L-BFGS in the loopy case, since the gradients and the objective value are then approximate and may not exactly match, so I expect L-BFGS to fail in that case. But SGD and RProp may still work, although, like you say, in a highly connected graph it might give bad results.

(Aug 29 '10 at 15:12) Carl1234

L-BFGS is really just an improved version of gradient descent. Why do you think SGD will work better with approximate gradients?

(Aug 29 '10 at 16:04) Yaroslav Bulatov

Because L-BFGS does not behave very well when the gradient changes a lot---it builds an approximation to the hessian that can be wildly unstable if the gradient varies a lot. Line minimizers also behave really badly in this setting, since for two function evaluations the values can be completely different, so it's hard to find a minimum (and most L-BFGS implementations rely on a line minimizer).

(Aug 29 '10 at 16:10) Alexandre Passos ♦

If estimate of the Hessian is wildly unstable, then the gradient must be even more wildly unstable (since the estimated hessian is a kind of average of last few gradients), so it seems like any gradient based method will have a problem there

(Aug 29 '10 at 16:14) Yaroslav Bulatov

This doesn't necessarily follow: suppose we're in the middle of training, and some examples are reasonably well classified while others are wrong. If you're doing SGD, some of the time you will make a big update (for the really wrong things) and some of the time you will make a small update (for the almost correct things). SGD just doesn't remember what happened last iteration.

Now assume you're doing just SGD+line minimization (roughly l-bfgs when m=1). The first round gets a function value and a gradient, and steps in the direction of the gradient with some default step size it plans to narrow later. Then it evaluates the function, to see if the reduction is big enough, or if it needs to narrow/increase the step size. However, imagine you picked a very wrong example in the first step, and now you pick a very well-classified example: even if the step didn't change much, L-BFGS thinks it has succeeded and moves on to another iteration. On the other hand, suppose the first iteration picks a mostly correct example and the second picks a really incorrect one: now suddenly the line minimizer thinks that step was really crappy (remember log loss can vary in orders of magnitude between different examples in the beggining of training) and gets another example, which gets a completely different log loss, etc. How can you do reliable line minimization in this setting?

Stochastic second-order optimization methods generally avoid using line searches for this very reason. See james martens' paper on hessian-free methods or http://eprints.pascal-network.org/archive/00003992/01/SchYuGue07.pdf on a stochastic version of l-bfgs for similar issues.

(Aug 29 '10 at 16:23) Alexandre Passos ♦

OK, it seems like we are talking about different things, I meant regular (batch) l-BFGS. It seems it should always be better than batch gradient descent. The relevant question is whether it's always better than stochastic gradient descent.

(Aug 29 '10 at 19:27) Yaroslav Bulatov

I'm talking about the original algorithm, by Liu and Nocedal, described IIRC in this paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.110.6443&rep=rep1&type=pdf

(Aug 29 '10 at 19:30) Alexandre Passos ♦

If training set log-likelihood is your objective function, then l-BFGS computes gradient over whole training set at each step, and I don't see how it's the same as SGD

(Aug 29 '10 at 19:39) Yaroslav Bulatov
showing 5 of 8 show all
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.