Are there any well known databases with real valued feature vectors and classification labels which people tend to use to test clustering? Particularly difficult databasets where error rate is below 80% would be nice.
I see alot of choices from the UCI repository, but I'm not sure which one would be suitable.
Are there any NLP databases that have real valued feature vectors for each sample?
It is debatable wether you can evaluate clustering, pure and simple, just using labels. While you can easily get performance numbers, most serious practitioners and theorists argue that these numbers don't mean anything, as it is not a priori true that one wants necessarily the labels as used. For example, when clustering amazon reviews, one can be interested in grouping by:
etc. Any clustering algorithm that successfully finds clusters good according to one of these criteria will probably perform very poorly when measured using one of the other criteria. I at least have been interested in each of these clusterings in the past, and some people can certainly think of other interesting options.
This caveat assumed, use any standard dataset close enough to the domain you are interested in claiming good performance in, and try to measure performance against the usual gold-standard label. For clustering text, for example, most people would pick some category structure from the Reuters RCV-1 labels as ground truth, and for vision they'd look at one of the many object recognition datasets.
What I would advise is for you to read carefully papers describing whatever baseline technique you're building upon or comparing against, and use their datasets and performance metrics at least, possibly including something else new that you find relevant. If you care about actually true scientific conclusions the only way to evaluate is to use clustering as part of some bigger process with a true loss function, and measure how the performance of that changes if you switch clustering algorithms. This, however, makes no claims as to how your algorithm will generalize to other end-to-end scenarios.
For a deeper discussion on these topics by some prominent experts in the area read the opinion paper on the nips 2009 workshop on clustering.
Have a look at other cluster algorithm papers - most use data sets like the half rings or the cigar dataset. However if you are creating a clustering algorithm for real time Internet crawling, these datasets will be next to useless for you.
As with any clustering, you must work out why you are clustering to work out how you are going to test it. Too many papers in the area are simply "My algorithm clusters these datasets better than algorithm Y, QED". That doesn't say anything helpful.
answered Apr 24 '11 at 09:19