I'm working on writing some code that compares clustering assignments to a set of gold standard labels that uses Adjusted Mutual Information. In some of the datasets we have, there is only one gold standard label (so all data points are given only that label) and several clustering solutions similarly create just one cluster. From a scoring perspective, these one cluster solutions should receive a perfect score since their assignments matched exactly with the gold standard scores, but Mutual Information and it's adjusted form, Adjusted Mutual Information, automatically score there results with a 0.
They both result in a score of 0 because of the core computation in Mutual Information:
In theory Mutual Information is supposed to measure the amount of knowledge gained/known about one distribution given another distrubtion, so a high Mutual Information indicates that the two distributions are equally informative of each other and a low Mutual Information indicates that the two are completely independent. With just one cluster and one label, the two situations become ambiguous, you have perfect information about the class labels given the clustering labels but at the same time, they are probabilisticly independent of each other; however MI and AMI both default to the later interpretation of independence.
So my question is, is it rediculous to assign a MI/AMI score of 1.0 to the odd case where there's only one event for both distributions, i.e. one cluster and one gold standard label? The situation itself is kind of silly, but that's somewhat out of my hands.
asked Feb 06 '12 at 14:54
You said it yourself - Mutual information is the amount of information gained through having new knowledge. Knowing that a data point exists (as that is all you really have when you have one label) does not give you extra information (you already knew it existed).
answered Feb 07 '12 at 00:36