|
The wikipedia page on cluster analysis has a good section on cluster evaluation that can serve as a starting pointer to further research. The usual case is that in a quality clustering each document has low intra-cluster distance (clustered documents are similiar) and high inter-cluster distance (documents that weren't clustered together are dissimiliar). However, because these measures aren't grounded in an objective quality metric, the best scoring clustering set may not be the most useful. Thus I'd strongly recommend using a few different quality and distance metrics, then going over the results by hand to see which best match with your intuitive view of what a good cluster would look like. |
|
If you know nothing about the ground truth the only way you can evaluate a clustering is by comparing it with the modeling assumptions. K-means minimizes the sum of the distances between each point and its closest center, so you can measure the quality of a k-means clustering by looking at this number. |
|
Yes, it's a metric called "partition quality." Alexandre gave an example using K-means. A partition quality here would be the sum of the inverse total variance within each cluster. I've used Fisher's partition quality from CobWeb. |
|
sir there are so many validity indexes to evaluate clustering but there should be some standard values available for these indexes so that one could compare the clustering results with these standard values. thanks sunil gautam sunilgautam82@gmail.com
This answer is marked "community wiki".
|
|
Try the Silhouette Coefficient. It is higher when clusters are separated (i.e. clusters do not overlap) and dense (i.e. each point is very similar to other points within its cluster). That said, if you do not know how you are going to evaluate your clustering, you have to ask yourself -- why are you clustering at all? If you are trying to find patterns, then you use that as a means to evaluate the cluster. If you are looking to compare against something, you need that ground truth. If you are looking to find natural partitions in the data, you are looking for cluster quality (i.e. using the Silhouette Coefficient). One final thing I'll say is that you may not know the ground truth, but do you have any external knowledge about the data that you can use. In my eCRS 2010 paper, I used clustering on phishing webpages to determine authorship -- although I have no idea exactly who the authors were. To overcome this, I used external information, the domains the websites were on. I took the assumption that, in most cases at least, that two phishing websites on the same domain are likely to be from the same group/author. This allowed me to verify that the clusterings were 'mostly correct', and led to an insight that the clusters I got corresponded to campaigns but there was strong evidence that some of the clusters needed to be joined (in future research). Keep in mind though -- that was a fairly big assumption, but that is something you need to consider when dealing with unsupervised learning. |
Also see this question on crossvalidated: http://stats.stackexchange.com/questions/7175/understanding-comparisons-of-clustering-results/7425#7425
Related question: How do you choose K without any ground truth? Though there might be several Ks that are plausible, not every K is plausible.
For something like k-means, one could use the average distortion as a way to select K (look at the "elbow point" on the average distortion vs K curve). For probabilistic models (like GMM), one could use the log-likelihood.
Try something like PG-means or bayesian k-means, which automatically derives a value for k. Its not always the correct value, but testing on a large number of corpora seemed to give good values most of the time.
X-means is another algorithm which can help. Check : http://www.cs.cmu.edu/~dpelleg/kmeans.html
Why are you doing clustering? Until you answer that question, you can't get advice more useful than Alexandre Passos's answer below to use whatever it is your clustering algorithm is optimizing.