I'm trying to use the external validation measure v-measure to evaluate the clustering algorithm. After clustering on the training dataset, I use the test dataset to compute the v-mesure. For each data from the test dataset I search for the nearest representative (cluster to wich this data should be associated) and I construct a matrix A of C x K dimensions when C is the number of classes and K the number of clusters (representatives), this matrix tell us how many data of class c are associated to a cluster k. Then I compute the homogeneity using the formula described in the paper. But I don't understand what the big N in the formula stands for, it's not specified in the paper, do you have any idea ? alt text

asked Jan 17 at 08:26

Shna's gravatar image

Shna
284162029

1

It looks like it's the total number of data points.

(Jan 17 at 10:09) Alexandre Passos ♦

One Answer:

As Alexandre is saying I think that N is the total number of points. I also think there are notation inconsistencies in the paper. I rewrote them with a consistent notation in the scikit-learn documentation.

Also note earlier in the same section that V-measure is not adjusted for chance: it will favor random clusterings with a higher number of clusters.

answered Jan 18 at 01:33

ogrisel's gravatar image

ogrisel
398464480

edited Jan 18 at 01:34

In your link, the big N used to compute H(C|K) is the same as the small n used to compute H(C). Is it really the same (i.e. the total number of data points) ? I think that the small n in the paper stands for the number of classes (but I'm not sure).

@AlexandrePassos

(Jan 18 at 06:22) Shna
1

I think the paper notation is incorrect and n and m definitions are not respected. In my version I think I did the maths from the conditional entropy definition in Wikipedia and applied it to the clustering setting. I might have made mistakes but I am reasonably confident as the measures look good in practice.

(Jan 18 at 11:37) ogrisel

Yes, you're probably right, it gives a more reasonable results for my evaluation. Thanks. I wonder why there was no correction of the paper's notation !

(Jan 18 at 12:07) Shna
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.