If you cluster the MNIST images into 25 groups with some clustering algorithm, and then at test time for every group you simply assign the label that is most frequent among the training images assigned to it... What would the test set classification error rate be? This basically measures how well the clustering algorithm manages to find the natural classes in the data. Did anybody ever report any result like that? I tried Google but couldn't find it.

Tijmen

asked Sep 06 '12 at 17:52

Tijmen%20Tieleman's gravatar image

Tijmen Tieleman
31113


2 Answers:
-2

As to what the test set error would be, this would largely depend on how well the specific clustering algorithms measures similarity in a sense relevant for the classes. Obviously this will vary widely across different domains and datasets. Asymptotically, as the number of clusters increase, you would expect the setup you've described to perform similarly to a nearest neighbor algorithm.

When people use labeled data to evaluate clustering algorithms it is more common to use metrics such as cluster purity or entropy. This book chapter provides a decent overview (section 8.5.7 covers supervised clustering metrics).

answered Sep 07 '12 at 12:58

alto's gravatar image

alto
60351124

1

I think Tijmen is looking for a number or a reference to a clustering result on MNIST not an intro to clustering. Presumably he will have to just try what he suggests himself if no one has tried it before.

(Sep 07 '12 at 15:23) gdahl ♦

Since clustering is very similar to k-NN but slightly inferior, I would imagine the results would be slightly worse than those present in the MNIST benchmarks described here.

answered Sep 08 '12 at 00:47

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.