2
3

I've recently become interested in unsupervised training of log-linear models for NLP (and maybe other tasks), so I've decided to create this question to collect good papers and/or code pointers in this area.

To get the ball, rolling, I'm aware of:

  • Hinton's contrastive divergence, and its followup paper with a more detailed analysis (although this doesn't necessarily seem applicable in all cases). There's also the version that uses three-way interactions. In one line, this seems to "push probability mass" towards the observed data by shifting it from samples from the model to the observed data. You can make it efficient by either approximating the distribution of the samples from the model by a short amount of gibbs sampling or using a variational lower bound on the CD gradient and doing SGD.

  • Smith and Eisner's contrastive estimation; it also works for grammar induction. It works by determining a priori a neighborhood B for each training datum and maximizing prod_i(P(datum_i)/sum_B_i(neighbor_b)). Intuitively, this shifts mass from the elements in this neighborhood (usually defined by transposing/deleting/replacing words or subsequences in the data) to the data points. You can make it efficient by exploring properties of neighborhoods to use variations of sum-product or inside-outside to compute the (gradient of the) denominator, and optimize using whatever numeric gradient optimizer you want.

  • There's always EM, although EM is not always efficient for this sort of model. There is this approach, which uses an approximation of the denominator.

  • Not exactly MRF-related, but there is this Kirkpatrick et al paper that replaces the output distributions of latent variable (directed?) models with maximum entropy classifiers, and maximizes the probability either with a gradient approach or with EM. It seems to outperform Smith and Eisner's model (with Haghighi and Klein's features).

I assume there should be sampling and variational approaches, but I'm not aware of them producing good results. Is there an obvious reason they shouldn't word or are difficult/expensive?

Does anyone know where to find links to code for these approaches?

Are there some techniques I'm missing?

This question is marked "community wiki".

asked Aug 13 '10 at 17:09

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Dec 03 '10 at 07:14


2 Answers:

Lets say MRF is a density over vector y of the form P(y)propto exp(sum_{ij} f(y_i,y_j,w), ie, fully observed MRF. Here a rough overview of fitting such MRF's to data. Notice that some of the approaches were introduced for CRFs, but just fix x during testing and training and you get an MRF.

  1. Maximize joint likelihood:

    Nice property of MRF's is that joint likelihood is a convex function of the parameters, so there's a unique maximum, and you can find best fit by either solving for gradient=0 exactly, or numerically by gradient descent.

  2. Maximize marginal likelihood:

    Pointwise log-loss minimization ("Training Conditional Random Fields for Maximum Labelwise Accuracy", "Alternative objective for Markovian fields"). Notice that this objective makes parameters coupled even for tree-structured models, so need to do gradient descent.

  3. Maximize some kind of margin?

    I would expect there to be approaches which do away with normalization altogether, akin to how Maximum Margin Markov fields relate to CRF's, although I can't find any examples right now.

There's also approach that I haven't seen published and am personally fond of -- take one of recently discovered "deterministic fully polynomial time approximation algorithms" for counting colorings, extend it to general Gibbs fields, and use it to compute a guaranteed approximation to the gradient of log-likelihood. This paper gives the version that's most amenable to such extension. I've actually worked out the math but don't have a good problem to apply it to, if anyone wants details, I'm happy to share.

This answer is marked "community wiki".

answered Aug 13 '10 at 18:00

Yaroslav%20Bulatov's gravatar image

Yaroslav Bulatov
2333214365

edited Aug 14 '10 at 19:47

Thanks!

About joint likelihood: how do you define that for unsupervised models (ie, where you don't have access to the ys but want to estimate them, like in unsupervised POS tagging)? I thought pseudo-likelihood was very closely related to conditional likelihood. The non-local contrastive objectives paper seems interesting, I'll read it.

The "Training conditional random fields for maximum labelwise accuracy" paper seems supervised, as well as the "alternate objective paper".

Yes, I was also looking for margin-based approaches, but it is kinda hard to visualize how they should work without supervision.

Do you care to explain a bit how colorings relate to learning in gibbs fields?

(Aug 13 '10 at 18:28) Alexandre Passos ♦

I think the hardest problem to overcome with a margin-based approach to unsupervised learning of markov random fields is what to put on the other side of the margin. If you put all other examples, you become about as efficient as EM, if you define a neighborhood it gets very similar to Smith and Eisner. One-class SVM separates points from the origin, so I guess this idea could be extended to the max-margin-markov setting, but it doesn't seem able to deal with latent variables, which is the setting I'm thinking about (ie, it seems to work perfectly in the complete-data case, but not in the incomplete-data case).

(Aug 13 '10 at 18:40) Alexandre Passos ♦
1

Above are just general techniques for learning P(X) where P is modelled as a Markov Field. As to how apply them to POS tagging....well....if you don't have examples of POS tags, how can you ever learn to label them? Do you perhaps mean semi-supervised learning?

Both papers on pointwise log-likelihood are applied to supervised task, but they don't have to be. Just take a dummy fixed x and then learn parameters using their technique. It'll then give you a density over y's without dependence on any x.

Without labels, margin based approaches need a new definition of "margin". For instance, like on one-class SVM.

I wrote an overview of the MRF-colorings connection on mathoverflow, but here's a brief summary: Suppose you sample uniformly from the set of proper colorings on a graph. This random variable is distributed as P(x)propto exp{sum_{ij} I(x_i=x_j)) where the sum is taken over all the edges in the graph. Combinatorics people are really interested in counting the number of proper colorings which is the same as the partition function of distribution above, and there's an equation (Gamarnik calls it "Cavity Expansion") which lets you express the number of proper colorings in terms of marginals of the distribution of the form P(x). They came up with a novel family of algorithms for finding marginal P(x_i=c) in that particular model, but they can be extended to general Gibbs fields.

(Aug 13 '10 at 18:56) Yaroslav Bulatov

The traditional approach to learn the POS tags without examples is to learn a model with latent variables that looks like a HMM (label-label features and label-word features, with observed words (and other features) and hidden labels) and label the latent variables in some way (this is sort-of cheating, yes, but some people try to minimize the cheating; see the Haghighi and Klein paper I linked in the question for more comments on this, including a strategy that leads to less cheating and more accurate estimates). The idea is to learn the distributional properties of POS tags, not necessarily their names (since this is the harder part of the problem).

If you're just learning a MRF with only observed variables (like vision people do to later use without supervision, for inpainting or whateter) the generalization of one-class SVM to M³Ns is trivial, as is adapting the Joachims cutting-plane algorithm, I think. With hidden variables the task is harder, but maybe it can be generalized to an expectation-max-margin thing (ie, minimize ||w|| - C/n sum( max(E_y[w·phi(x)] - p, 0))), where the expectation is over the hidden variables and computed with sum-product. I'm not sure the cutting plane algorithm works for this (specially since with that expectation the constraints might get too ugly), but I guess this is worth looking at.

That's a cool connection between combinatorics and graphical models. I'll read you summary and the paper more carefully, it seems interesting.

(Aug 13 '10 at 19:25) Alexandre Passos ♦

BTW, I'm having some trouble getting formatting to work... "-" looks like bulleted list in the preview, but after committing, they show up as numbered lists

(Aug 14 '10 at 15:05) Yaroslav Bulatov

also, sounds like you are talking about Hidden Markov Random Fields, which are distinct from MRF's and not log-linear. (in MRF, log-odds of any two observations is a linear functions of parameters)

(Aug 14 '10 at 15:10) Yaroslav Bulatov

I mean MRFs with hidden variables (so when you have the complete data the unnormalized likelihood is log-linear, but the same is not true for incomplete data).

The formatting in this site is really crappy, currently. I also have some issues with italics (since they keep showing up whenever I want to type subscripts and behave differently in the preview and the site)

(Aug 14 '10 at 15:27) Alexandre Passos ♦

I think I got it right, however, with the formatting.

(Aug 14 '10 at 15:32) Alexandre Passos ♦

Your answer's not exactly off-topic; some of these methods can be adapted; the Vickrey one is specially relevant, and the pseudo-likelihood as well. I'd remove the notice.

(Aug 14 '10 at 18:48) Alexandre Passos ♦
showing 5 of 9 show all

For structured prediction tasks, there is the unsupervised version of the SEARN algorithm by Hal Daume.

This answer is marked "community wiki".

answered Aug 13 '10 at 19:25

spinxl39's gravatar image

spinxl39
3698114869

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.