|
Quoting from an answer from http://metaoptimize.com/qa/questions/5750/idiot-proof-machine-learning-methods, concerning K-means:
How could one decide which preprocessing techniques should be made on a set of data before applying this clustering algorithm? What characteristics of the data set are critical in deciding whether and how the data should be preprocessed before throwing the data into the algorithm? |
|
I don't quite agree with David there. As far as I know from the paper, Coates used whitening, which is very natural. Coates et al used the ZCA whitening instead of vanilla sphering so one can still look at the filters. However this shouldn't affect clustering. Kmeans has a certain "roundness" assumtion. For example you can not separate two close parallel lines when you want each to be a single cluster. If you use whitening first, this problem goes away. There are three ways that seem somewhat plausible as preprocessing to me:
(Actually the mean doesn't matter - so you might as well leave it out to compute the variance). If the different axis have very different scales and there is no reason one axis should be more important for clustering than the others, the different axis should be normalized. A good way for that is zero-mean unit variance. If the variance in some places is very low, this might be quite noisy so one might want to use scaling between 0 and 1 instead. As I said before, sphereing get's rid of the "roundness" of clusters. In practice this means the main direction of variation is as important as any other. You should ask yourself: Do I want my clusters to explain variance along the main directions or do I want cluster to cover variance in all possible directions? [Edit] In principle one can use any linear or non-linear embedding before using KMeans. And that might for some particular example. Above I am talking just very general. [/Edit] Hope that is helpful. Cheers, Andy
This answer is marked "community wiki".
|
|
If "preprocessing" includes "choice of metric", then very important. Hastie et al., Elements of Statistical Learning p. 506:
In particular, image / text search seem to me (no expert) to have highly specialized metrics. Does anyone have examples of successful Kmeans with the Euclidean metric in > 50d (success measured by CV or other validation) ? As I tried to say you can do any kind of embedding (which is basically the same as using another metric) before classification and if you do it "right", then your results will improve. In computer vision, I think about 90% of the papers use KMeans - in high dimensions and without any specialized metric. It is used to generate dictionaries for "Bag of (visual) Word" models by clustering descriptors. Btw, the paper by Coates et. al. can be viewed as a very successful application. These might not be succesful by your definition since the quality of clustering is not measured.
(Aug 03 '11 at 04:32)
Andreas Mueller
1
Measuring success of clustering by cross-validation is deceiving. Cross-validation will just let you know whether the clustering is stable, and there are easy ways to hack any clustering algorithm to be more stable at the expense of performance (for example, use some deterministic initialization instead of random, and often by using either a lot more or a lot fewer clusters than is ideal). Clusterings are used for one of two things: (1) to aid human understanding of a data set and (2) as part of a complex system. If your case is (1) then you should evaluate things by looking at lots of plots and squinting and maybe comparing with other alternatives. Here CV will add nothing as it is possible that an unstable algorithm will usually discover interesting structure. It might even be beneficial to have more than one clustering structure to look at. If your case is (2) then you can use whatever metric is used to evaluate the end-to-end system. Again here stability is not that great a criterion, and things like distortion or dimensionality reduction might actually be pretty good proxies for end-to-end performance. However, the key thing in clustering is the similarity metric.
(Aug 03 '11 at 05:01)
Alexandre Passos ♦
Andreas, what surprises me (non-expert) in Coates & Ng is how similar results are for quite different methods, "71.4 % far worse ... than 77.9" (table 1, CIFAR-10). Is 70-odd % good for image matching ?
(Aug 03 '11 at 06:25)
denis
How good "good" is depends - like everywhere in machine learning I guess - on the dataset. In CIFAR-10 there are 10 equally distributed (I think) classes, giving a chance accuracy of 10%. But this doesn't really say anything about how hard the dataset is. For example MNIST also has 10 classes but even the most naive classifiers like KNN have 97% accuracy. So 98% is not really good on this dataset. So what "good" means is really dependent on what people can achieve on a dataset at the moment. About the "far worse" part: I think often differences are judged relative to full accuracy. So the difference between 95% and 98% is "bigger" than the difference between 60% and 70%. - I don't know whether this is based on some sound statistical principle or not. Maybe someone else knows? In this spirit 71% is really quite worse than 78%. Other differences in the paper (71.5 vs 72.2 or something) might not be that significant - especially due to the large range of models that was tested.
(Aug 04 '11 at 13:29)
Andreas Mueller
Andreas, re 97 % accuracy on MNIST: would you have a link to a lib that can back that up, either KNN or SVM too ? (Allow me to be skeptical -- I'm appalled by the crummy tests that some packages come with.) Thanks / Schönes Wochenende
(Aug 05 '11 at 09:33)
denis
Denis, you can find a list of results with references here: http://yann.lecun.com/exdb/mnist/ For SVM parameters, you can have a look at my blog: http://peekaboo-vision.blogspot.com/2010/09/mnist-for-ever.html You can do the knn yourself by using any machine learning library and setting k=3. You don't even need a library. I think a KNN can be implemented in two lines of python (or matlab for that matter).
(Aug 05 '11 at 09:39)
Andreas Mueller
showing 5 of 6
show all
|