|
Usually when DP and HDP models are talked about in NLP it's to allow an arbitrary number of classes/clusters/state, but I think the same think should apply to words. Let's say we're doing document classification with naive bayes. The model looks like
But this feels slightly weird since the total number of words affects the smoothing a lot. In this model, the probability of a word being generated in a document in a class is (C_w+alpha)/(C_class + alpha*V). So the smoothing is a lot stronger in classes with few word types being observed, specially if you have a huge vocabulary. Why not replacing it by something like
(here each word is represented as a unique real number just to make sure that the probability of two different samples from the words distribution being the same is 0, and can be ignored when implementing it) Now the probability of observing a word in a class is, if it has been observed in that class, proportional to C_w/(N_c + alpha), and if it hasn't proportional to alpha/(N_c+alpha). This feels a lot cleaner, since a class doesn't automatically waste a lot of its probability mass on words that are never going to be allocated to it. Two issues with this model is that, due to the hierarchical part, different words will have different probabilities (and if you remove the HDP aspect you will get a very unrealistic model, since documents in different classes could never share words as a matter of principle), and that it should maybe be a PY process to account for power-law behavior. So it's easier to esimate probabilities in (at least if you're gibbs sampling) and doesn't have an odd normalization. Why don't more people use this as the standard way of doing dirichlet-multinomial things for NLP problems? This would also naturally extend to the online setting where you don't know how many words will be observed in the corpus. |
|
There has been some work in the past that assumes the number of words is infinite as well. For example, Goldwater et. al. (2006)'s nips paper on type-token relationships uses pitman-yor processes on words as well http://cocosci.berkeley.edu/tom/papers/typetoken.pdf . My best guess for why most people don't use it is that in practice, using a large vocabulary size vs. an infinite vocabulary size turns out not to make too much of a difference and in general, people rather use a finite vs. an infinite model because infinity is scary. Thanks for the reference, this seems the sort of thing I was looking for. I agree that if your vocabulary is fixed it shouldn't matter much (except in the naive bayes case where you become capable of making new distinctions between high-entropy and low-entropy classes with the infinite model, but I'm not sure it should matter much, if at all).
(Sep 26 '10 at 15:46)
Alexandre Passos ♦
|
|
Actually this has been done (and better than I would have imagined) in Hardisty, Boyd-Graber, and Resnik, Modeling Perspective using Adaptor Grammars, in EMNLP 2010. The main difference (which makes their work better) is that they use adaptor grammars instead of a shallow DP, in one stroke allowing collocations and an unlimited vocabulary size that is different for each class. Interesting, I'll have to skim this.
(Oct 20 '10 at 20:55)
Art Munson
|
I guess I'm a bit confused by your model above. If you have a DP generating words, then you have an infinite number of possible words. Doesn't this mean you always have the possibility of generating a previously unseen word? How do you handle this situation?
In actual NLP you're always coming upon new words; I know of no complete corpus of any language that includes all made-up nouns, proper nouns, made-up adjectives, archaic forms, mispellings, etc.
I'm interested in how this works in practice. Alex, you might get more "answers" if you make it clearer exactly what your question is.
@Art Munson: my question is that I always see DPs and HDPs used to deal with the arbitrary number of classes problem, but I think it actually seems a better fit for the arbitrary number of words problem in NLP, and I find it really odd that nobody does this already, so I want to know if there's any work in this direction or any obvious roadblocks.