I have a three types of experiments with many measurements taken. I want to show that using clustering on the unlabelled measurements, I can uncover the three groups. Now using kmeans plus, I get around 60% accuracy. Which if it was binary, would be poor, but how do I quantify how good that is for three categories? Is there a p value I can compute and compare against? I am thinking that if there were 100 categories 60% for the correct labelling is great and far from random, so how can I do this with a test?

asked Aug 25 '11 at 20:12

VassMan's gravatar image

VassMan
179121517


3 Answers:

If you have supervised-signal (which seems to be the case) Robert suggestion for V-measure is indeed probably good start as it is labeling-permutation independent. However V-measure is not adjusted for chance: totally random labeling will have non-null V-measure especially when k is big. You can try AMI which is adjusted for chance instead. See Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance by Vinh et al. for a detailed analysis.

To get you started, this blog post provides python implementations for most of the aforementioned methods and others. Disclaimer: I haven't personally checked the correctness of those implementations.

If you don't have supervised signal, you can try to evaluate the quality of your clustering by quantifying the stability of your algorithms + hypeparameters (e.g. k in k-means) on random sub sample of your data. See Clustering Stability: An Overview by Ulrike von Luxburg.

answered Aug 26 '11 at 13:15

ogrisel's gravatar image

ogrisel
398464480

@orgisel Is AMI in scikits.learn?

(Aug 30 '11 at 20:18) Robert Layton

Not yet but would be worth adding it in my opinion. Pull requests welcomed as usual.

(Aug 30 '11 at 23:02) ogrisel

NP. I should have some time next week (progress on my current PR is stalled as well for time), I'll put it on my list.

(Aug 30 '11 at 23:13) Robert Layton

Great, looking forward to it.

(Aug 30 '11 at 23:28) ogrisel
2

Thanks for the pointer. The behavior of V-measure with high values of k is a serious limitation. I'm going to check out AMI and chance normalization of conditional entropy as a direction for improvement.

(Sep 06 '11 at 21:47) Andrew Rosenberg

That would be great indeed. I wanted to come up with a closed formula for an adjusted for chance variant of completeness, homogeneity and V-Measure usable as a stability measure for model selection by haven't found the time to do yet. Please keep me posted of your developments.

(Sep 07 '11 at 04:11) ogrisel

If AMI fixes the problems of V-measure at high model order, it would be fantastic.

(Sep 07 '11 at 15:26) Gael Varoquaux
showing 5 of 7 show all

It almost sounds like you have labels at test time but not at training time. If that is the case, what you're trying to do is very similar to unsupervised multi-class classification, where the goal is to classify each test example into one of K categories.

In multi-class classification, one simple baseline is the accuracy of classifying every example into the most popular class. For example, if your dataset is split 40% class A, 30% class B, and 30% class C, then the naive baseline is 40%, since you can get that accuracy just by classifying everything as the most popular class. If your method gets 60% accuracy, then that's significantly better than the baseline.

answered Aug 29 '11 at 15:01

Yisong%20Yue's gravatar image

Yisong Yue
3311414

Baseline is a good measure. However you have to be careful using "significantly" when all you have is a higher number - there needs to be some distribution known about the results.

(Aug 29 '11 at 18:38) Robert Layton

@Robert Layton, and @Yisong Yue. I am not sure what baseline really is and how to quantify it. Is 60% really good? What if there is no apriori knowledge about the distribution?

(Aug 30 '11 at 11:32) VassMan

The baseline is really easy to calculate - as @Yisong Yue said, you take the percentage of instances that form the largest class. In the example, the largest class had 40% of instances, so that is the baseline.

The theoretical circumstance is a classifier that just does the above, and returns "Class 1" for every prediction, regardless of the inputs. Such a classifier will get 40% accuracy.

(Aug 30 '11 at 20:15) Robert Layton

@Robert The other degenerate solution is actually worse. If you were to construct N clusters where N is the number of data points, the "accuracy" is 100%.

The point is that accuracy is an undefined term for clustering, and most evaluation measures that require a class assignment for each cluster suffer from the "problem of matching" as defined by Meila.

(Sep 06 '11 at 21:58) Andrew Rosenberg

There are three approaches you could take. I'm sure there are more, but these are what I think are the main ones:

  1. Compare against state of the art. Quite simple - run other methods or algorithms on the dataset, and determine if your approach beats them. This is probably the normal approach.
  2. Run a Monte Carlo simulation to determine what the chance rate is. A bit more difficult - determine what the parameters are in the methodology and run a Monte Carlo simulation to determine the distribution of scores. You can then calculate the estimated p-value from this simulation. It is, however, often tricky to determine exactly what to randomise and what to keep the same.
  3. Determine a cost benefit to using this approach. If there is an industry application to this - i.e. using your approach to find customers more likely to buy a given product - determine what the benefit to an organisation would be, in using your product.

In my thesis I went with 1 and 2, to compare how good my clusterings of documents by authorship were.

I'd also like to point out, when it comes to clustering "accuracy" doesn't really make sense - how do you know your mapping of clusters to classes is the 'right' one. Try something like the V-measure, which was developed by Andrew Rosenberg, who uses this very MetaOptimize QA.

answered Aug 25 '11 at 22:17

Robert%20Layton's gravatar image

Robert Layton
1520102337

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.