|
I've recently become interested in unsupervised training of log-linear models for NLP (and maybe other tasks), so I've decided to create this question to collect good papers and/or code pointers in this area. To get the ball, rolling, I'm aware of:
I assume there should be sampling and variational approaches, but I'm not aware of them producing good results. Is there an obvious reason they shouldn't word or are difficult/expensive? Does anyone know where to find links to code for these approaches? Are there some techniques I'm missing?
This question is marked "community wiki".
|
|
Lets say MRF is a density over vector y of the form P(y)propto exp(sum_{ij} f(y_i,y_j,w), ie, fully observed MRF. Here a rough overview of fitting such MRF's to data. Notice that some of the approaches were introduced for CRFs, but just fix x during testing and training and you get an MRF.
There's also approach that I haven't seen published and am personally fond of -- take one of recently discovered "deterministic fully polynomial time approximation algorithms" for counting colorings, extend it to general Gibbs fields, and use it to compute a guaranteed approximation to the gradient of log-likelihood. This paper gives the version that's most amenable to such extension. I've actually worked out the math but don't have a good problem to apply it to, if anyone wants details, I'm happy to share.
This answer is marked "community wiki".
Thanks! About joint likelihood: how do you define that for unsupervised models (ie, where you don't have access to the ys but want to estimate them, like in unsupervised POS tagging)? I thought pseudo-likelihood was very closely related to conditional likelihood. The non-local contrastive objectives paper seems interesting, I'll read it. The "Training conditional random fields for maximum labelwise accuracy" paper seems supervised, as well as the "alternate objective paper". Yes, I was also looking for margin-based approaches, but it is kinda hard to visualize how they should work without supervision. Do you care to explain a bit how colorings relate to learning in gibbs fields?
(Aug 13 '10 at 18:28)
Alexandre Passos ♦
I think the hardest problem to overcome with a margin-based approach to unsupervised learning of markov random fields is what to put on the other side of the margin. If you put all other examples, you become about as efficient as EM, if you define a neighborhood it gets very similar to Smith and Eisner. One-class SVM separates points from the origin, so I guess this idea could be extended to the max-margin-markov setting, but it doesn't seem able to deal with latent variables, which is the setting I'm thinking about (ie, it seems to work perfectly in the complete-data case, but not in the incomplete-data case).
(Aug 13 '10 at 18:40)
Alexandre Passos ♦
1
Above are just general techniques for learning P(X) where P is modelled as a Markov Field. As to how apply them to POS tagging....well....if you don't have examples of POS tags, how can you ever learn to label them? Do you perhaps mean semi-supervised learning? Both papers on pointwise log-likelihood are applied to supervised task, but they don't have to be. Just take a dummy fixed x and then learn parameters using their technique. It'll then give you a density over y's without dependence on any x. Without labels, margin based approaches need a new definition of "margin". For instance, like on one-class SVM. I wrote an overview of the MRF-colorings connection on mathoverflow, but here's a brief summary: Suppose you sample uniformly from the set of proper colorings on a graph. This random variable is distributed as P(x)propto exp{sum_{ij} I(x_i=x_j)) where the sum is taken over all the edges in the graph. Combinatorics people are really interested in counting the number of proper colorings which is the same as the partition function of distribution above, and there's an equation (Gamarnik calls it "Cavity Expansion") which lets you express the number of proper colorings in terms of marginals of the distribution of the form P(x). They came up with a novel family of algorithms for finding marginal P(x_i=c) in that particular model, but they can be extended to general Gibbs fields.
(Aug 13 '10 at 18:56)
Yaroslav Bulatov
The traditional approach to learn the POS tags without examples is to learn a model with latent variables that looks like a HMM (label-label features and label-word features, with observed words (and other features) and hidden labels) and label the latent variables in some way (this is sort-of cheating, yes, but some people try to minimize the cheating; see the Haghighi and Klein paper I linked in the question for more comments on this, including a strategy that leads to less cheating and more accurate estimates). The idea is to learn the distributional properties of POS tags, not necessarily their names (since this is the harder part of the problem). If you're just learning a MRF with only observed variables (like vision people do to later use without supervision, for inpainting or whateter) the generalization of one-class SVM to M³Ns is trivial, as is adapting the Joachims cutting-plane algorithm, I think. With hidden variables the task is harder, but maybe it can be generalized to an expectation-max-margin thing (ie, minimize ||w|| - C/n sum( max(E_y[w·phi(x)] - p, 0))), where the expectation is over the hidden variables and computed with sum-product. I'm not sure the cutting plane algorithm works for this (specially since with that expectation the constraints might get too ugly), but I guess this is worth looking at. That's a cool connection between combinatorics and graphical models. I'll read you summary and the paper more carefully, it seems interesting.
(Aug 13 '10 at 19:25)
Alexandre Passos ♦
BTW, I'm having some trouble getting formatting to work... "-" looks like bulleted list in the preview, but after committing, they show up as numbered lists
(Aug 14 '10 at 15:05)
Yaroslav Bulatov
also, sounds like you are talking about Hidden Markov Random Fields, which are distinct from MRF's and not log-linear. (in MRF, log-odds of any two observations is a linear functions of parameters)
(Aug 14 '10 at 15:10)
Yaroslav Bulatov
I mean MRFs with hidden variables (so when you have the complete data the unnormalized likelihood is log-linear, but the same is not true for incomplete data). The formatting in this site is really crappy, currently. I also have some issues with italics (since they keep showing up whenever I want to type subscripts and behave differently in the preview and the site)
(Aug 14 '10 at 15:27)
Alexandre Passos ♦
I think I got it right, however, with the formatting.
(Aug 14 '10 at 15:32)
Alexandre Passos ♦
Your answer's not exactly off-topic; some of these methods can be adapted; the Vickrey one is specially relevant, and the pseudo-likelihood as well. I'd remove the notice.
(Aug 14 '10 at 18:48)
Alexandre Passos ♦
showing 5 of 9
show all
|
|
For structured prediction tasks, there is the unsupervised version of the SEARN algorithm by Hal Daume.
This answer is marked "community wiki".
|