I'm working with a data-set with about 30m samples. I've trained an auto-encoder, and I'd like to apply kmeans clustering to it, however it would be incredibly slow to just hurl the whole training set at it.

With that said, I have a few questions here:

  • Are there some good approaches to training a model using the kernel trick that can keep the problem tractable
  • Is it possible to apply any of those tricks (assuming they exist) to a black-box implementation? (I'd prefer to avoid rolling my own, since I'm not well versed on the kmeans algorithm)

Any input you can provide (even "You're nuts" is fine, as long as you back it up) would be greatly appreciated.

asked Apr 10 '12 at 01:39

Brian%20Vandenberg's gravatar image

Brian Vandenberg
824213746

edited Apr 10 '12 at 01:41

For reference, so far I've tried the obvious thing -- just break the set up into a bunch of mini-batches and repeatedly choose (at random with replacement) a batch and re-use the previous set of centroids, removing empty or duplicates centroids. This seems to work reasonably well, I suppose what I'm asking is whether there is a better way or if this will be 'good enough'.

(Apr 10 '12 at 01:58) Brian Vandenberg

Do you mean kernel k-means, or just vanilla k-means?

Vanilla k-means can be pretty fast if distributed; I've managed to code up a parallel implementation that could cluster ~ 1m datapoints in my laptop in less than an hour. With MapReduce, for example, you can implement the E step (finding the closest center) as map and the M step (updating the centers) as reduce (of course you'd partially reduce while doing the E step, to avoid bloating the network).

(Apr 10 '12 at 07:13) Alexandre Passos ♦

That's a good question, I was under the clearly mistaken assumption that k-means is a kernel based method. I'll have to think this through some more. Thank you, Alex.

(Apr 10 '12 at 10:11) Brian Vandenberg
Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.