|
I have a data which contains several columns which I later reduced using a PCA algorithms to two different components . I then applied the Kmeans algorithms to the data. How can I verify that my data clustered well into each group or how do I determine misclassification? For instance using R if I check the cluster vector say k$cluster against the labels of the data I had previously before clustering can I just draw a confusion matrix from that and assume that 1 in the clustered vector is equivalent to 1 in my labels.
Please note this is a hypothetical data My data is way bigger than this. |
|
I had a similar problem with trying to validate my clustering results. I used a modified version of an F1 score, where I looked at every pair of data samples. If a pair had the same label, it was a 'true' pair. If a pair was assigned to the same cluster, it was a 'positive' pair. With these definitions I calculated the TP,TN,FP,FN values, then calculated recall and precision, then finally F1. This provides a quantitative way to compare two clusterings. Not the only one, but a good one, I think. If you want to find the which clusters go with which labels, you need a metric to compare a cluster with a label. I would suggest the Jaccard, but there are alternatives. Once you've calculated the Jaccard for each cluster-label pair, you 'simply' have an assignment problem. Dynamic programming can solve it and there's code out there that will do it for you. I used an implementation of Earth Movers Distance by Rubner and it worked very nicely when dealing with 200 labels and clusters. Hope that helps! Unfortunately, this only works if you have the true labels. If you're doing it for exploratory purposes, the tests are more case-by-case.
(Sep 16 '11 at 22:53)
Jacob Jensen
|