|
I'm working with a data-set with about 30m samples. I've trained an auto-encoder, and I'd like to apply kmeans clustering to it, however it would be incredibly slow to just hurl the whole training set at it. With that said, I have a few questions here:
Any input you can provide (even "You're nuts" is fine, as long as you back it up) would be greatly appreciated. |
For reference, so far I've tried the obvious thing -- just break the set up into a bunch of mini-batches and repeatedly choose (at random with replacement) a batch and re-use the previous set of centroids, removing empty or duplicates centroids. This seems to work reasonably well, I suppose what I'm asking is whether there is a better way or if this will be 'good enough'.
Do you mean kernel k-means, or just vanilla k-means?
Vanilla k-means can be pretty fast if distributed; I've managed to code up a parallel implementation that could cluster ~ 1m datapoints in my laptop in less than an hour. With MapReduce, for example, you can implement the E step (finding the closest center) as map and the M step (updating the centers) as reduce (of course you'd partially reduce while doing the E step, to avoid bloating the network).
That's a good question, I was under the clearly mistaken assumption that k-means is a kernel based method. I'll have to think this through some more. Thank you, Alex.