Are there any well known databases with real valued feature vectors and classification labels which people tend to use to test clustering? Particularly difficult databasets where error rate is below 80% would be nice.

I see alot of choices from the UCI repository, but I'm not sure which one would be suitable.

Are there any NLP databases that have real valued feature vectors for each sample?

asked Apr 21 '11 at 15:58

crdrn's gravatar image


edited Apr 21 '11 at 16:10

3 Answers:

It is debatable wether you can evaluate clustering, pure and simple, just using labels. While you can easily get performance numbers, most serious practitioners and theorists argue that these numbers don't mean anything, as it is not a priori true that one wants necessarily the labels as used. For example, when clustering amazon reviews, one can be interested in grouping by:

  1. star rating of the review
  2. perceived helfulness of the review
  3. positive/negative sentiment of the review
  4. product or product category of the review
  5. author of the review

etc. Any clustering algorithm that successfully finds clusters good according to one of these criteria will probably perform very poorly when measured using one of the other criteria. I at least have been interested in each of these clusterings in the past, and some people can certainly think of other interesting options.

This caveat assumed, use any standard dataset close enough to the domain you are interested in claiming good performance in, and try to measure performance against the usual gold-standard label. For clustering text, for example, most people would pick some category structure from the Reuters RCV-1 labels as ground truth, and for vision they'd look at one of the many object recognition datasets.

What I would advise is for you to read carefully papers describing whatever baseline technique you're building upon or comparing against, and use their datasets and performance metrics at least, possibly including something else new that you find relevant. If you care about actually true scientific conclusions the only way to evaluate is to use clustering as part of some bigger process with a true loss function, and measure how the performance of that changes if you switch clustering algorithms. This, however, makes no claims as to how your algorithm will generalize to other end-to-end scenarios.

For a deeper discussion on these topics by some prominent experts in the area read the opinion paper on the nips 2009 workshop on clustering.

answered Apr 21 '11 at 17:34

Alexandre%20Passos's gravatar image

Alexandre Passos ♦

edited Apr 21 '11 at 17:38

Well I'm working on a process that basically handles input data to final classification (ie X -> label). The clustering process I use is sort of a middle step to the overall system.

(Apr 21 '11 at 17:44) crdrn

Then you really should evaluate the clustering algorithm by change on the classification error. It will be a lot more meaningful than any "pure" clustering objective or metric.

(Apr 21 '11 at 17:53) Alexandre Passos ♦

I see what you mean. I think I can try the process on the most popular UCI datasets (Iris, Wine, breast cancer) and measure the classification error on each.

(Apr 21 '11 at 18:12) crdrn

UCI datasets have been recently frowned upon due to being very small and unrepresentative. It's better to pick a domain specific dataset, like CIFAR or MNIST for images or RCV-1 for text, among others.

(Apr 21 '11 at 18:16) Alexandre Passos ♦

Well the issue with image datasets is that I would have to do some substantial preprocessing to extract specific features since my method would not work on raw images.

My main argument is that this method can be used to classify feature vectors. Images requires a system that's fairly invariant to shifts and scale changes, so my results would really depend more on the preprocessing method rather than the actual system.

(Apr 21 '11 at 22:59) crdrn

"Best precision and recall on 20-newsgroups?" discusses the state-of-the-art in clustering and other unsupervised methodologies, evaluating on the 20 Newsgroups textcat corpus.

answered Apr 22 '11 at 16:04

Joseph%20Turian's gravatar image

Joseph Turian ♦♦

I've received negative reviews on a clustering paper using 20 Newsgroups due to it being a small and unrealistic dataset, so your mileage may vary.

(Apr 22 '11 at 16:05) Alexandre Passos ♦

Have a look at other cluster algorithm papers - most use data sets like the half rings or the cigar dataset. However if you are creating a clustering algorithm for real time Internet crawling, these datasets will be next to useless for you.

As with any clustering, you must work out why you are clustering to work out how you are going to test it. Too many papers in the area are simply "My algorithm clusters these datasets better than algorithm Y, QED". That doesn't say anything helpful.

answered Apr 24 '11 at 09:19

Robert%20Layton's gravatar image

Robert Layton

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.