Hello everyone,

I have a problem, that at first hand looks simpler than the average, but it might get tricky.

I need to cluster N sets of 1D iid variables in different L clusters. So by the end I would have NxL means. The basic problem, then, is to cluster a set of 1D variables.

I was thinking on using either the EM or K-means implementations on sickit-learn (since speed is an issue I trust their implementation rather than mine).

But, perhaps I'm missing something, and it is easier than I think, perhaps there are easier methods to do inference over 1D data for a mixture of Gaussians.

Does anyone knows of any method.

Regards

Leon

asked Nov 24 '11 at 23:39

Leon%20Palafox's gravatar image

Leon Palafox
31265471107


One Answer:

Is there any significance to the fact that you have N datasets? Do you want to cluster them in a way where the solutions for each dataset are statistically dependent (tied together)? In that case, the hierarchical Dirichlet process would be an excellent choice, although it doesn't enforce that there are L clusters (it infers the number of clusters according to the Chinese restaurant process). The biggest drawback of HDP (and other Bayesian methods) would be the significantly longer runtime, which you mentioned was a concern.

If you want to handle each of the N datasets separately, then k-means is the obvious choice for a really fast method. Note, however, that k-means assumes that the covariances of each cluster are spherical and equal to each other. In the case of 1D data, the spherical covariance assumption isn't a problem (there is no covariance for 1D data), but the assumption of equal variances for each cluster could potentially be very incorrect. If you want to guard against that, you can use an EM algorithm where the variances are estimated alongside the means. EM is also pretty fast (although not nearly as fast as k-means).

answered Nov 26 '11 at 00:26

Kevin%20Canini's gravatar image

Kevin Canini
12001328

Of course, if he wants a fixed L he can just use a hierarchical Dirichlet prior rather than process.

(Nov 26 '11 at 07:24) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.