|
I am using kmeans clustering on high dimensional data. As I increase the number of categories to cluster in, the extra clusters labels do not align with the general pattern of the data. I have seen methods/approaches that put priors on the number of clusters and take a probabilistic approach, but I wonder if there is a more basic examination technique available. I am clustering unsupervised, but I know that there are differences in the dataset and where they should be. There should be a pattern of progression between these categories. Like, in the first group most should be of a certain label x, and the next data points in another category y, etc. So if there should be only 2 or 3 categories and I put in 5, and see that a new label is rarely allocated, can I measure in some way how redundant it is? |
|
The following references should help:
FYI, I have just implemented Adjusted Rand Index in scikit-learn and plan to implement the Consensus Index soon. As for the Bayesian approach, you can have a look at Gaussian Mixture Models implemented with a Dirichlet Process prior (as implemented by Alexandre Passos in scikit-learn. Looking forward to compare the two approaches. @ogrisel: I use matlab, and don't know python yet, although I think I should learn it soon. The papers seem very interesting, I will read them. Thanks
(Sep 13 '11 at 12:24)
VassMan
|
|
See the X-means: Extending K-means with Efficient Estimation of the Number of Clusters (2000) article; it presents:
(Sep 14 '11 at 05:11)
Robert Layton
Are any of these methods implemented in matlab? They look great but I would not want to implement them myself.
(Sep 14 '11 at 09:01)
VassMan
http://www.cs.cmu.edu/~dpelleg/kmeans.html seems to contain a full x-means implementation.
(Sep 14 '11 at 09:25)
Lucian Sasu
|
|
If you know the categories of even a handful of data points, a semi-supervised approach may do better than purely unsupervised. Generally, you only go unsupervised if you absolutely have to. If you can't/don't want to do that, you can try a hierarchical clustering approach, which will allow you to choose the clusters you want, after ordering them in a tree for you. |
There is a really naive approach of taking the ratio between your total data and the amount of data each cluster should have to get an estimate of how many clusters you will have. This approach, however has many down comes, and I only use it to get a good first guess.