I read in Teh's encyclopedia article about Dirichlet processes that some people use a parametric base distribution in a nonparametric model. How does that work exactly? You just smooth the parametric model with a kernel method? What's a good reference for this?

What if you had a multinomial classification task and you had a good previously trained distribution over n classes, and you wanted to extend that to infinitely many classes, using a nonparametric model? Would you just open a different Chinese Restaurant at each of the known classes?

asked Jul 10 '10 at 20:38

Frank's gravatar image

Frank
1064233948


2 Answers:

The Dirichlet Process (what he was talking about; you meant the Applications section, right?) has two parameters: a base distribution and a concentration parameter. Samples from the DP, when averaged out, approach in the limit the base distribution (just like samples from the dirichlet process, on average, approach the base discrete probability measure). For example, in a very simplified DP mixture model, you can sample each point in the dataset from a gaussian with a mean sampled from a DP, with another gaussian as a base distribution, as in:

m_i ~ DP(N(0;10), alpha)

p_i ~ N(m_i, 1)

What this means is that the points are going to be grouped in normally distributed clusters and the means of these clusters are going to be normally distributed themselves. Hence, the first normal distribution (the distribution of the means) is the base measure of the DP (which makes it possible that two or more points come from the same mean, which would have probability zero if all the means were independently sampled from a gaussian).

Is this clear? The DP acts like a "concentrator" on its base distribution, sampling from it but also remembering and reusing past samples.

As for your second question, your original model would probably be

theta_i ~ Dirichlet(alpha) ; the multinomial distribution of each class

eta ~ Dirichlet(beta) ; the prior over classes

pc_j ~ Categorical(eta) ; the class of a document

d_j ~ Multinomial(theta_{pc_j}) ; the words of each document

Now, if you want a dirichlet process to have infinitely many class, each with its base probability sampled from a dirichlet distribution, you have

class_j ~ DP(Dirichlet(alpha), beta) ; alpha is the concentration of the word distribution, beta is the concentration of the class distribution

words_j ~ Multinomial(class_j)

Interestingly enough, the DP model is usually simpler than the original one. You can also use the HDP model, in which the base measure of the DP is itself a DP, like in HDP-LDA (Teh talks about it in his HDP tutorial). Is this clear?

answered Jul 10 '10 at 20:59

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1677242188306

This is one of the clearest explanations (in a very short space) that I've seen on this topic. Nice job Alexandre.

(Sep 18 '11 at 22:15) Will Darling

This paper might be useful -- "Construction of Nonparametric Bayesian Models from Parametric Bayes Equations." P Orbanz, NIPS 2009.

answered Sep 18 '11 at 20:01

Balaji%20Lakshminarayanan's gravatar image

Balaji Lakshminarayanan
111

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.