I am using kmeans clustering on high dimensional data. As I increase the number of categories to cluster in, the extra clusters labels do not align with the general pattern of the data. I have seen methods/approaches that put priors on the number of clusters and take a probabilistic approach, but I wonder if there is a more basic examination technique available.

I am clustering unsupervised, but I know that there are differences in the dataset and where they should be. There should be a pattern of progression between these categories. Like, in the first group most should be of a certain label x, and the next data points in another category y, etc. So if there should be only 2 or 3 categories and I put in 5, and see that a new label is rarely allocated, can I measure in some way how redundant it is?

asked Sep 13 '11 at 09:27

VassMan's gravatar image

VassMan
209121518

1

There is a really naive approach of taking the ratio between your total data and the amount of data each cluster should have to get an estimate of how many clusters you will have. This approach, however has many down comes, and I only use it to get a good first guess.

(Sep 13 '11 at 19:26) Leon Palafox ♦

3 Answers:

The following references should help:

FYI, I have just implemented Adjusted Rand Index in scikit-learn and plan to implement the Consensus Index soon.

As for the Bayesian approach, you can have a look at Gaussian Mixture Models implemented with a Dirichlet Process prior (as implemented by Alexandre Passos in scikit-learn.

Looking forward to compare the two approaches.

answered Sep 13 '11 at 10:12

ogrisel's gravatar image

ogrisel
498995591

@ogrisel: I use matlab, and don't know python yet, although I think I should learn it soon. The papers seem very interesting, I will read them. Thanks

(Sep 13 '11 at 12:24) VassMan

See the X-means: Extending K-means with Efficient Estimation of the Number of Clusters (2000) article; it presents:

a new algorithm that efficiently searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion (BIC)...

answered Sep 13 '11 at 17:03

Lucian%20Sasu's gravatar image

Lucian Sasu
513172634

Other approaches include G-means and PG-means.

(Sep 14 '11 at 05:11) Robert Layton

Are any of these methods implemented in matlab? They look great but I would not want to implement them myself.

(Sep 14 '11 at 09:01) VassMan

http://www.cs.cmu.edu/~dpelleg/kmeans.html seems to contain a full x-means implementation.

(Sep 14 '11 at 09:25) Lucian Sasu

If you know the categories of even a handful of data points, a semi-supervised approach may do better than purely unsupervised. Generally, you only go unsupervised if you absolutely have to.

If you can't/don't want to do that, you can try a hierarchical clustering approach, which will allow you to choose the clusters you want, after ordering them in a tree for you.

answered Sep 14 '11 at 05:15

Robert%20Layton's gravatar image

Robert Layton
1625122637

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.