|
I'm trying to use the external validation measure v-measure to evaluate the clustering algorithm.
After clustering on the training dataset, I use the test dataset to compute the v-mesure. For each data from the test dataset I search for the nearest representative (cluster to wich this data should be associated) and I construct a matrix A of C x K dimensions when C is the number of classes and K the number of clusters (representatives), this matrix tell us how many data of class c are associated to a cluster k. Then I compute the homogeneity using the formula described in the paper. But I don't understand what the big N in the formula stands for, it's not specified in the paper, do you have any idea ?
|
|
As Alexandre is saying I think that N is the total number of points. I also think there are notation inconsistencies in the paper. I rewrote them with a consistent notation in the scikit-learn documentation. Also note earlier in the same section that V-measure is not adjusted for chance: it will favor random clusterings with a higher number of clusters. In your link, the big N used to compute H(C|K) is the same as the small n used to compute H(C). Is it really the same (i.e. the total number of data points) ? I think that the small n in the paper stands for the number of classes (but I'm not sure). @AlexandrePassos
(Jan 18 at 06:22)
Shna
1
I think the paper notation is incorrect and n and m definitions are not respected. In my version I think I did the maths from the conditional entropy definition in Wikipedia and applied it to the clustering setting. I might have made mistakes but I am reasonably confident as the measures look good in practice.
(Jan 18 at 11:37)
ogrisel
Yes, you're probably right, it gives a more reasonable results for my evaluation. Thanks. I wonder why there was no correction of the paper's notation !
(Jan 18 at 12:07)
Shna
|

It looks like it's the total number of data points.