I don't understand how DP works to cluster similar members. The k-means algorithm considers the distance between a cluster and potential members. But in the DP descriptions I don't see a similar consideration. It seems that the members are assigned to clusters without regard to the member's or cluster's properties.
Here is the description of the chinese restaurant process from wikipedia:
"The restaurant in question operates a ... seating policy whereby new diners are seated either at a currently occupied table with probability proportional to the number of guests already seated there, or at an empty table with probability proportional to a constant."
For a meaningful clustering, I would expect a new diner to be placed with people she is similar to. Without consideration of the properties of a new diner and the existing clusters I don't follow how the DP creates clusters with members having any relationship that is useful.
asked Jul 30 '12 at 21:46
Like Mat mentioned, DP is only half of the process:
First of all, you have to remember that the CRP or DP is a prior over clusters, not a clustering method, to get a full comprehension on how a prior works, I'll base my explanation on this excellent paper on Gibbs Sampling:
Great Technical Paper on Gibbs Sampling (Be sure to read it)
In there, you can see that the author defines a Dirichlet distribution as a prior of all the possible classes, the Dirichlet distribution is only a way to code 2 different Multivariate distributions to cluster the data.
The real work of the clustering is when you calculate the likelihood (probability of belonging) of each data item to each class. In the standard way, you do this by evaluating the likelihood times the prior (which in this case is the Dirichlet Distribution).
Now imagine that you do not know how many clusters do you have before hand, you still have your multivariate to evaluate the likelihoods. So you use the CRP to have a prior that does not require previous knowledge on the clusters. As you go on the Gibbs Sampling, most methods automatically sample the concentration parameter (alpha) to update it as you have new data.
This way the CRP changes the numbers of clusters as needed, but you still have to evaluate the likelihood of the data being part of that particular class which you do by means of other distributions.
So at the end, using a CRP will ensure that the rich gets richer, but also that the data has to fit its particular cluster with certain probability.
I recommend you to implement the paper I just cited, it is going to give you tons of insight on how this thing is really working.
I recommend you this paper, which is the case for when you have Gaussian Distributions instead of Multivariates.
Yeah I remember having exactly the same confusion. The thing is that the CRP is only half of the picture giving you priors on the number of clusters (the tables) and the parameters of each cluster. You need to do additional work to go from this prior, with likelihood, to converge on the posterior corresponding to your data. One way of doing this iteration is with a gibbs sampler.
Check out this lecture; http://videolectures.net/icml05_jordan_dpcrp/ for an awesome walkthrough. The CRP description is at around 20min (but the whole tute is really good)
Some other descriptions I found useful include http://blog.echen.me/2012/03/20/infinite-mixture-models-with-nonparametric-bayes-and-the-dirichlet-process/ and http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf