|
I have a three types of experiments with many measurements taken. I want to show that using clustering on the unlabelled measurements, I can uncover the three groups. Now using kmeans plus, I get around 60% accuracy. Which if it was binary, would be poor, but how do I quantify how good that is for three categories? Is there a p value I can compute and compare against? I am thinking that if there were 100 categories 60% for the correct labelling is great and far from random, so how can I do this with a test? |
|
If you have supervised-signal (which seems to be the case) Robert suggestion for V-measure is indeed probably good start as it is labeling-permutation independent. However V-measure is not adjusted for chance: totally random labeling will have non-null V-measure especially when k is big. You can try AMI which is adjusted for chance instead. See Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance by Vinh et al. for a detailed analysis. To get you started, this blog post provides python implementations for most of the aforementioned methods and others. Disclaimer: I haven't personally checked the correctness of those implementations. If you don't have supervised signal, you can try to evaluate the quality of your clustering by quantifying the stability of your algorithms + hypeparameters (e.g. k in k-means) on random sub sample of your data. See Clustering Stability: An Overview by Ulrike von Luxburg. @orgisel Is AMI in scikits.learn?
(Aug 30 '11 at 20:18)
Robert Layton
Not yet but would be worth adding it in my opinion. Pull requests welcomed as usual.
(Aug 30 '11 at 23:02)
ogrisel
NP. I should have some time next week (progress on my current PR is stalled as well for time), I'll put it on my list.
(Aug 30 '11 at 23:13)
Robert Layton
Great, looking forward to it.
(Aug 30 '11 at 23:28)
ogrisel
2
Thanks for the pointer. The behavior of V-measure with high values of k is a serious limitation. I'm going to check out AMI and chance normalization of conditional entropy as a direction for improvement.
(Sep 06 '11 at 21:47)
Andrew Rosenberg
That would be great indeed. I wanted to come up with a closed formula for an adjusted for chance variant of completeness, homogeneity and V-Measure usable as a stability measure for model selection by haven't found the time to do yet. Please keep me posted of your developments.
(Sep 07 '11 at 04:11)
ogrisel
If AMI fixes the problems of V-measure at high model order, it would be fantastic.
(Sep 07 '11 at 15:26)
Gael Varoquaux
showing 5 of 7
show all
|
|
It almost sounds like you have labels at test time but not at training time. If that is the case, what you're trying to do is very similar to unsupervised multi-class classification, where the goal is to classify each test example into one of K categories. In multi-class classification, one simple baseline is the accuracy of classifying every example into the most popular class. For example, if your dataset is split 40% class A, 30% class B, and 30% class C, then the naive baseline is 40%, since you can get that accuracy just by classifying everything as the most popular class. If your method gets 60% accuracy, then that's significantly better than the baseline. Baseline is a good measure. However you have to be careful using "significantly" when all you have is a higher number - there needs to be some distribution known about the results.
(Aug 29 '11 at 18:38)
Robert Layton
@Robert Layton, and @Yisong Yue. I am not sure what baseline really is and how to quantify it. Is 60% really good? What if there is no apriori knowledge about the distribution?
(Aug 30 '11 at 11:32)
VassMan
The baseline is really easy to calculate - as @Yisong Yue said, you take the percentage of instances that form the largest class. In the example, the largest class had 40% of instances, so that is the baseline. The theoretical circumstance is a classifier that just does the above, and returns "Class 1" for every prediction, regardless of the inputs. Such a classifier will get 40% accuracy.
(Aug 30 '11 at 20:15)
Robert Layton
@Robert The other degenerate solution is actually worse. If you were to construct N clusters where N is the number of data points, the "accuracy" is 100%. The point is that accuracy is an undefined term for clustering, and most evaluation measures that require a class assignment for each cluster suffer from the "problem of matching" as defined by Meila.
(Sep 06 '11 at 21:58)
Andrew Rosenberg
|
|
There are three approaches you could take. I'm sure there are more, but these are what I think are the main ones:
In my thesis I went with 1 and 2, to compare how good my clusterings of documents by authorship were. I'd also like to point out, when it comes to clustering "accuracy" doesn't really make sense - how do you know your mapping of clusters to classes is the 'right' one. Try something like the V-measure, which was developed by Andrew Rosenberg, who uses this very MetaOptimize QA. |