|
Nonparametric Bayesian models have been shown to perform well for sequence labeling tasks like unsupervised part-of-speech tagging (Goldwater, 2007) and others. Now I have two questions:
|
|
The short answer is that its quite complicated to do so. The reason is that most nonparametric Bayesian and Bayesian model in general are substantially easier to learn and perform inference in if the conjugate prior is used. The conjugate to the standard multinomial distribution is the Dirichlet prior. For the Dirichlet process, marginalizing out parameters to obtain a clustering posterior utilizes this property. If you have a locally-normalized distribution which uses arbitrary log-linear features like this paper that Alex mentioned (BTW that model isn't really Bayesian) that emission distribution isn't quite conjugate to anything we know. So most of the tools in Bayesian toolbox don't work here. This isn't to say that it isn't possible, but there's a reason why its hard to see how it can work. It is possible to go NAIVE-BAYES and just emit features independently for each cluster using Beta or Dirichlet priors on feature distributions, but I was assuming you meant features in the log-linear style common now. |
|
There's a paper on this year's NAACL that incorporates arbitrary features into parametric unsupervised bayesian models. However, I think you should be able to use an unsupervised model in that same setting with one of the popular truncated approximations. |
|
For the particular problem of unsupervised part-of-speech tagging, there are two ways you can go about integrating features into the model (whether you use a parametric or non-parametric model)
To me this smells a bit like generative vs. discriminative and I'd love to see a comparison of the two for an NLP task. I have to admit I don't know how these models compare to unsupervised log-linear models. Hopefully someone else has a good answer for that. PS: (Goldwater2007) was a parametric unsupervised HMM. This paper describes a nonparametric version of the same model. |
|
To answer the second part of your question, you can use any features that you can build a likelihood model for. That is, so long as you can compute P(observation | parameters) they'll fit in the model. You should also have a prior distribution over your parameters P(parameters), which can be somewhat arbitrary and is often chosen for computational convenience. |