|
I read in Teh's encyclopedia article about Dirichlet processes that some people use a parametric base distribution in a nonparametric model. How does that work exactly? You just smooth the parametric model with a kernel method? What's a good reference for this? What if you had a multinomial classification task and you had a good previously trained distribution over n classes, and you wanted to extend that to infinitely many classes, using a nonparametric model? Would you just open a different Chinese Restaurant at each of the known classes? |
|
The Dirichlet Process (what he was talking about; you meant the Applications section, right?) has two parameters: a base distribution and a concentration parameter. Samples from the DP, when averaged out, approach in the limit the base distribution (just like samples from the dirichlet process, on average, approach the base discrete probability measure). For example, in a very simplified DP mixture model, you can sample each point in the dataset from a gaussian with a mean sampled from a DP, with another gaussian as a base distribution, as in: m_i ~ DP(N(0;10), alpha) p_i ~ N(m_i, 1) What this means is that the points are going to be grouped in normally distributed clusters and the means of these clusters are going to be normally distributed themselves. Hence, the first normal distribution (the distribution of the means) is the base measure of the DP (which makes it possible that two or more points come from the same mean, which would have probability zero if all the means were independently sampled from a gaussian). Is this clear? The DP acts like a "concentrator" on its base distribution, sampling from it but also remembering and reusing past samples. As for your second question, your original model would probably be theta_i ~ Dirichlet(alpha) ; the multinomial distribution of each class eta ~ Dirichlet(beta) ; the prior over classes pc_j ~ Categorical(eta) ; the class of a document d_j ~ Multinomial(theta_{pc_j}) ; the words of each document Now, if you want a dirichlet process to have infinitely many class, each with its base probability sampled from a dirichlet distribution, you have class_j ~ DP(Dirichlet(alpha), beta) ; alpha is the concentration of the word distribution, beta is the concentration of the class distribution words_j ~ Multinomial(class_j) Interestingly enough, the DP model is usually simpler than the original one. You can also use the HDP model, in which the base measure of the DP is itself a DP, like in HDP-LDA (Teh talks about it in his HDP tutorial). Is this clear? This is one of the clearest explanations (in a very short space) that I've seen on this topic. Nice job Alexandre.
(Sep 18 '11 at 22:15)
Will Darling
|
|
This paper might be useful -- "Construction of Nonparametric Bayesian Models from Parametric Bayes Equations." P Orbanz, NIPS 2009. |