Nonparametric Bayesian models have been shown to perform well for sequence labeling tasks like unsupervised part-of-speech tagging (Goldwater, 2007) and others. Now I have two questions:

  • I'm wondering how the performance of these Bayesian models compares to unsupervised log-linear models (e.g. contrastive estimation), where you can often get very high performance through the addition of arbitrary features.

  • Is there a way to incorporate the features we use in a log-linear model into a Bayesian nonparametric model?

asked Jul 06 '10 at 10:49

Frank's gravatar image

Frank
1349274453

retagged Jul 06 '10 at 11:05

Jurgen's gravatar image

Jurgen
99531419


4 Answers:

The short answer is that its quite complicated to do so. The reason is that most nonparametric Bayesian and Bayesian model in general are substantially easier to learn and perform inference in if the conjugate prior is used. The conjugate to the standard multinomial distribution is the Dirichlet prior. For the Dirichlet process, marginalizing out parameters to obtain a clustering posterior utilizes this property.

If you have a locally-normalized distribution which uses arbitrary log-linear features like this paper that Alex mentioned (BTW that model isn't really Bayesian) that emission distribution isn't quite conjugate to anything we know. So most of the tools in Bayesian toolbox don't work here.

This isn't to say that it isn't possible, but there's a reason why its hard to see how it can work. It is possible to go NAIVE-BAYES and just emit features independently for each cluster using Beta or Dirichlet priors on feature distributions, but I was assuming you meant features in the log-linear style common now.

answered Jul 06 '10 at 14:35

aria42's gravatar image

aria42
209972441

There's a paper on this year's NAACL that incorporates arbitrary features into parametric unsupervised bayesian models. However, I think you should be able to use an unsupervised model in that same setting with one of the popular truncated approximations.

answered Jul 06 '10 at 11:15

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

For the particular problem of unsupervised part-of-speech tagging, there are two ways you can go about integrating features into the model (whether you use a parametric or non-parametric model)

  • as extra input features: you can condition the Markov chain on input features. The number of transition parameters in the model increases and you're going to have to come up with a clever hierarchical Bayes prior to work around that. We've done an extension of the iHMM with input features called the IO-iHMM in analogy to the equivalent parametric counterpart (IO-HMM)
  • as extra output features: again, the number of parameters in the output distribution is going to increase. Nonetheless, you can get better predictions if there is signal in those extra features.

To me this smells a bit like generative vs. discriminative and I'd love to see a comparison of the two for an NLP task. I have to admit I don't know how these models compare to unsupervised log-linear models. Hopefully someone else has a good answer for that.

PS: (Goldwater2007) was a parametric unsupervised HMM. This paper describes a nonparametric version of the same model.

answered Jul 06 '10 at 11:13

Jurgen's gravatar image

Jurgen
99531419

To answer the second part of your question, you can use any features that you can build a likelihood model for. That is, so long as you can compute P(observation | parameters) they'll fit in the model. You should also have a prior distribution over your parameters P(parameters), which can be somewhat arbitrary and is often chosen for computational convenience.

answered Jul 06 '10 at 11:07

Noel%20Welsh's gravatar image

Noel Welsh
72631023

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.