I guess the title of the question says it all - how can I find the best K for Map/Reduce'd K-Means? In a single machine, single process environment I could maybe use k-fold cross validation, etc. but I am afraid such methods might not scale in multi-node map-reduce Hadoop platforms.

Any ideas?

asked Feb 25 '13 at 06:35

Stat%20Q's gravatar image

Stat Q
21779


One Answer:

I don't think this has anything to do with mapreduce; every method I know for finding K in K means does a bunch of embarrassingly parallelizable different choices for k. It then either tries to find an elbow in the distortion vs k curve, or uses the clusters in an application specific way. If there was some very serial algorithm that you wanted to parallelize, MR might be difficult to apply. But the simple solution here is already a non-sequential one, so the MR framework doesn't get in your way.

answered Feb 25 '13 at 11:15

Rob%20Renaud's gravatar image

Rob Renaud
724111931

Today I came across an article that Canopy Clustering is used before KMeans, in MR fashion. Canopy apparently needs only few iterations, but gives good # of clusters. Then this is fed into KMeans which does the rest. What do you think about this?

(Feb 25 '13 at 12:09) Stat Q
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.