|
I guess the title of the question says it all - how can I find the best K for Map/Reduce'd K-Means? In a single machine, single process environment I could maybe use k-fold cross validation, etc. but I am afraid such methods might not scale in multi-node map-reduce Hadoop platforms. Any ideas? |
|
I don't think this has anything to do with mapreduce; every method I know for finding K in K means does a bunch of embarrassingly parallelizable different choices for k. It then either tries to find an elbow in the distortion vs k curve, or uses the clusters in an application specific way. If there was some very serial algorithm that you wanted to parallelize, MR might be difficult to apply. But the simple solution here is already a non-sequential one, so the MR framework doesn't get in your way. Today I came across an article that Canopy Clustering is used before KMeans, in MR fashion. Canopy apparently needs only few iterations, but gives good # of clusters. Then this is fed into KMeans which does the rest. What do you think about this?
(Feb 25 '13 at 12:09)
Stat Q
|