I am clustering documents by authorship, with known classes, in order to see what methods produce accuracy clusters. The issue I'm having at the moment is that I know which of the methods is producing the best clusterings, thanks to using the v-measure, however I have no basis for determining how good that result is. With a score like accuracy, I can intuitively understand that 90% accuracy is 'high'. While the range for the v-measure is 0 to 1, there is no intuitive understanding.

My idea to fix this is to find a distribution of scores using a Monte Carlo type simulation. That way I can say that my results are 'better than expected', giving a baseline result. Is this the best way to go about this, or are there other methods for determining such a baseline?

asked Jan 06 '11 at 20:19

Robert%20Layton's gravatar image

Robert Layton
1520102337


2 Answers:

There's no inherent baseline for V-Measure, just as there's no inherent baseline for Accuracy.

For accuracy, random assignment of data points to classes will give you a value that approaches the rate of the majority class. For a task with a signal to noise ratio over 90%, 90% accuracy is not good.

For v-measure, random assignment of class members to clusters will give you a value that approaches zero. V-measure is an information theory based approach -- random assignment leads to no decrease in conditional entropy H(C | K), thus a zero value for homogeneity, which will drive the v-measure to zero.

The intuition behind v-measure comes is based on the intuition of homogeneity and completeness being competing desirable qualities with similarity to precision and recall. The measures of homogeneity and completeness are intuitive only insofar as entropy is intuitive. If the range of homogeneity is thought of as ranging from 0 - completely random assignment of data points to clusters - to 1 - perfectly homogenous clusters. The calculation of homogeneity used by v-measure represents how close to a perfect clustering you are under this distance scale. Completeness works similarly ranging from 0 - completely random assignment of data points - to 1 - perfectly complete clusters.

About a baseline calculation, a baseline using v-measure should be calculated the same way you calculate a baseline for any other task. You can compare against an existing technique and make a point-wise comparison. Or, like you describe, generate a number of samples to approximate a distribution, so you can use a test to measure statistical significance of improvement. I'd recommend against any parametric testing though (t-test, e.g.); it's highly unlikely that v-measure values are normally distributed.

answered Jan 07 '11 at 14:17

Andrew%20Rosenberg's gravatar image

Andrew Rosenberg
156252135

Exactly the answer I was looking for, from exactly the person I was hoping. Between yours and Alexandre's answer, I think the best option would be to compare against some different methods. I'll also go the non-parametric route for testing significance, which was the aim with the Monte Carlo methods.

Thanks

(Jan 07 '11 at 18:57) Robert Layton
1

As a side note, the v-measure appears to be binomial when given 1 million random clusterings (compeltely random cluster values). Have a look http://imgur.com/Crhxh

(Jan 08 '11 at 23:19) Robert Layton

I think the only reasonable baseline definiton you can use is using both a very simple method (like random assignments) to get a lower bound and an unrealistic method (perfect assignments, human assignments to a subsample of your data, etc) to get an upper bound. Then you can see how well your methods behave w.r.t. these bounds.

It is also generally a good idea to have some similar prior work (either an algorithm that you can implement and test on your data or some previous paper with clear results on the same dataset) you can use as a baseline, so you can just report improvements over that. I find it untrustworthy when papers don't report any comparisons with other papers, and have to argue that the results "look good".

answered Jan 07 '11 at 12:05

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

Thanks for this. I agree that it doesn't look good when a paper is only comparing against itself. To that end, I'll have to test against some other methods.

(Jan 07 '11 at 18:58) Robert Layton
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.