I have a data which contains several columns which I later reduced using a PCA algorithms to two different components . I then applied the Kmeans algorithms to the data. How can I verify that my data clustered well into each group or how do I determine misclassification? For instance using R if I check the cluster vector say k$cluster against the labels of the data I had previously before clustering can I just draw a confusion matrix from that and assume that 1 in the clustered vector is equivalent to 1 in my labels.

col3    col2     Col1   lables                                           
123     2.32      2.50    0           
124    2.81      3.10     1     
125    2.72      3.09     2     
126    2.92      3.03     3     
127    2.32      2.95     4

Please note this is a hypothetical data My data is way bigger than this.

asked Sep 14 '11 at 13:12

Akinleye%20Adedamola's gravatar image

Akinleye Adedamola
1454


One Answer:

I had a similar problem with trying to validate my clustering results. I used a modified version of an F1 score, where I looked at every pair of data samples. If a pair had the same label, it was a 'true' pair. If a pair was assigned to the same cluster, it was a 'positive' pair. With these definitions I calculated the TP,TN,FP,FN values, then calculated recall and precision, then finally F1. This provides a quantitative way to compare two clusterings. Not the only one, but a good one, I think.

If you want to find the which clusters go with which labels, you need a metric to compare a cluster with a label. I would suggest the Jaccard, but there are alternatives. Once you've calculated the Jaccard for each cluster-label pair, you 'simply' have an assignment problem. Dynamic programming can solve it and there's code out there that will do it for you. I used an implementation of Earth Movers Distance by Rubner and it worked very nicely when dealing with 200 labels and clusters. Hope that helps!

answered Sep 16 '11 at 17:36

Jonathan%20Purnell's gravatar image

Jonathan Purnell
8624

Unfortunately, this only works if you have the true labels. If you're doing it for exploratory purposes, the tests are more case-by-case.

(Sep 16 '11 at 22:53) Jacob Jensen
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.