(1) What are the existing good measures allowing us to evaluate the quality of an unsupervised clustering algorithm, when we don't have labels associated to our data ?

(2) If we suppose now that we have a label associated to each data point, is there any efficient method to label the representatives after training ? Currently, I just associate to a representative the label that occur the most among the labels of data associated to this representative during training.

asked Jan 13 '12 at 08:20

shn's gravatar image

shn
462414759

edited Jan 13 '12 at 09:50


2 Answers:

I am not sure I understand you question but maybe the consensus index as introduced in the first answer to this question might help: Examining the number of centers for kmeans

answered Jan 13 '12 at 09:23

ogrisel's gravatar image

ogrisel
498995591

edited Jan 13 '12 at 09:24

Le link that you gave is more about the stability of number of clusters (the optimal number of representatives). My first question is simple: suppose you have a training/testing dataset of unlabelled data. You apply your unsupervised clustering algorithm on the training set, you will get K representatives. To cross-validate your clustering result you'll use the testing dataset on the K obtained representatives, but you can not measure for example the error rate, or F-Mesure etc since you have unlabelled data. Is there any adequate measure in this case ?

(Jan 13 '12 at 10:05) shn
2

The consensus approach of the first link it done as follows:

  • Split the data in 3 folds: A, B, C.

  • Apply the clustering algorithm twice independently on A+B and A+C to get the label assignments for each samples A+B and A+C

  • Compute the Adjusted Rand Index of the label assignments on the overlapping data A.

This score is called the Consensus Index, it is bounded up by 1 (perfect agreement and stability) and has an expected minimum of zero for random labellings.

By applying this approach for various number of k in k-means it can be possible to find some optimal value of k that maximized the Consensus Index.

(Jan 13 '12 at 10:32) ogrisel

Ok, well, I'll read the paper about Consensus Index later (http://ee.unsw.edu.au/~nguyenv/K_detection_15.pdf). Now, if I understand your comment, it needs to determine the label assignments from the results of clustering on an overlapped area of the dataset, then it can compute the score on this area, right?

In this case, can you tell me if it may be useful (and/or more accurate) if I compute (using the method that you described) the label assignments and score on "different" overlapped areas of the dataset ? Suppose for example that I have a dataset of 10 data points (to simplify) and I apply each time the clustering on 5 data points (and compute the score), something like:

[1 2 3 4 5] 6 7 8 9 10 ==> initial step

1 [2 3 4 5 6] 7 8 9 10 ==> compute score for area [2 3 4 5]

1 2 [3 4 5 6 7] 8 9 10 ==> compute score for area [3 4 5 6]

1 2 3 [4 5 6 7 8] 9 10 ==> compute score for area [4 5 6 7]

1 2 3 4 [5 6 7 8 9] 10 ==> compute score for area [5 6 7 8]

1 2 3 4 5 [6 7 8 9 10] ==> compute score for area [6 7 8 9]

It maybe interesting for me, since in my real application, I have a data stream where data come one by one (not all the dataset is available initially), and I don't want to keep in memory a huge number of data points.

(Jan 13 '12 at 11:31) shn

You would probably need a much larger sliding window (I would say 100 times the number of clusters) but yes I suppose you could adapt consensus index evaluation on streaming data using such kind of buffering of the stream.

(Jan 13 '12 at 11:38) ogrisel

However I am not aware of any reference work that evaluated Consensus Index model selection / evaluation in a streaming samples setting.

(Jan 13 '12 at 11:39) ogrisel
-1

(1) Choose a simple baseline clustering method, and use VI or V to compare to your own clustering.

(2) Your method is called many-to-one. There is also another method called one-to-one, where you allow only one induced label to be assigned to each annotated label (see for example (Johnson 2007)).

answered Jan 27 '12 at 20:23

Yariv%20Maron's gravatar image

Yariv Maron
17526

Are you sure about using VI and V ? I don't know about VI, but for V we should have a labled data to be able to evaluate homogeneity and completeness. Is it also the case for VI ? Recall that I have unlabelled data.

(Jan 29 '12 at 17:17) shn
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.