AS we use a DP for a mixture of distributions, we often end up with clusters that have very few elements, 1 or 2, over populations of 4000 elements.

My usual approach is to disregard those clusters and assign the elements to its nearest cluster.

Is there anything more formal to deal with these clusters? I'm guessing some kind of heuristics or model selection on the overall number of clusters.

Thanks

asked Nov 28 '12 at 01:37

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128

Samples from the Dirichlet process in general will look like that, with some number of very small clusters. This has to happen because of exchangeability, as if this only happened with very small probability then starting a new cluster would be very unlikely.

For practical purposes, though, yeah, getting rid of those might be worth it.

(Nov 28 '12 at 10:26) Alexandre Passos ♦
Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.