I'm doing clustering (say k-means) on some data for which I don't know the ground truth clustering. Are there still ways I could evaluate how good my clustering is?

asked Apr 06 '11 at 11:34

ebony's gravatar image

ebony
18181014

Also see this question on crossvalidated: http://stats.stackexchange.com/questions/7175/understanding-comparisons-of-clustering-results/7425#7425

(Apr 08 '11 at 02:09) Justin Bayer
3

Related question: How do you choose K without any ground truth? Though there might be several Ks that are plausible, not every K is plausible.

(Apr 08 '11 at 19:51) Joseph Turian ♦♦

For something like k-means, one could use the average distortion as a way to select K (look at the "elbow point" on the average distortion vs K curve). For probabilistic models (like GMM), one could use the log-likelihood.

(Apr 08 '11 at 20:06) ebony

Try something like PG-means or bayesian k-means, which automatically derives a value for k. Its not always the correct value, but testing on a large number of corpora seemed to give good values most of the time.

(Apr 21 '11 at 00:07) Robert Layton

X-means is another algorithm which can help. Check : http://www.cs.cmu.edu/~dpelleg/kmeans.html

(Apr 26 '11 at 17:03) mcenley
1

Why are you doing clustering? Until you answer that question, you can't get advice more useful than Alexandre Passos's answer below to use whatever it is your clustering algorithm is optimizing.

(Nov 03 '11 at 17:27) gdahl ♦
showing 5 of 6 show all

5 Answers:

The wikipedia page on cluster analysis has a good section on cluster evaluation that can serve as a starting pointer to further research. The usual case is that in a quality clustering each document has low intra-cluster distance (clustered documents are similiar) and high inter-cluster distance (documents that weren't clustered together are dissimiliar).

However, because these measures aren't grounded in an objective quality metric, the best scoring clustering set may not be the most useful. Thus I'd strongly recommend using a few different quality and distance metrics, then going over the results by hand to see which best match with your intuitive view of what a good cluster would look like.

answered Apr 07 '11 at 11:52

Paul%20Barba's gravatar image

Paul Barba
4314915

If you know nothing about the ground truth the only way you can evaluate a clustering is by comparing it with the modeling assumptions. K-means minimizes the sum of the distances between each point and its closest center, so you can measure the quality of a k-means clustering by looking at this number.

answered Apr 06 '11 at 12:25

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

Yes, it's a metric called "partition quality." Alexandre gave an example using K-means. A partition quality here would be the sum of the inverse total variance within each cluster. I've used Fisher's partition quality from CobWeb.

answered Apr 06 '11 at 13:59

Melipone%20Moody's gravatar image

Melipone Moody
221368

sir there are so many validity indexes to evaluate clustering but there should be some standard values available for these indexes so that one could compare the clustering results with these standard values. thanks sunil gautam sunilgautam82@gmail.com

This answer is marked "community wiki".

answered Nov 03 '11 at 01:09

sunil%20gautam's gravatar image

sunil gautam
1

Try the Silhouette Coefficient. It is higher when clusters are separated (i.e. clusters do not overlap) and dense (i.e. each point is very similar to other points within its cluster).

That said, if you do not know how you are going to evaluate your clustering, you have to ask yourself -- why are you clustering at all? If you are trying to find patterns, then you use that as a means to evaluate the cluster. If you are looking to compare against something, you need that ground truth. If you are looking to find natural partitions in the data, you are looking for cluster quality (i.e. using the Silhouette Coefficient).

One final thing I'll say is that you may not know the ground truth, but do you have any external knowledge about the data that you can use. In my eCRS 2010 paper, I used clustering on phishing webpages to determine authorship -- although I have no idea exactly who the authors were. To overcome this, I used external information, the domains the websites were on. I took the assumption that, in most cases at least, that two phishing websites on the same domain are likely to be from the same group/author. This allowed me to verify that the clusterings were 'mostly correct', and led to an insight that the clusters I got corresponded to campaigns but there was strong evidence that some of the clusters needed to be joined (in future research). Keep in mind though -- that was a fairly big assumption, but that is something you need to consider when dealing with unsupervised learning.

answered Apr 08 '11 at 21:35

Robert%20Layton's gravatar image

Robert Layton
1520102337

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.