1
1

Is anyone aware of an incremental K-Means clustering algorithm (preferably open source)? I've found a few academic papers on the subject, but I haven't been able to find any code.

asked Apr 25 '11 at 14:50

Cerin's gravatar image

Cerin
402253744


4 Answers:

There is an undergoing implementation (not sure how stable) at the scikits.learn.

Edit (by OG): Here a link to the PDF of the paper Web Scale K-Means Clustering - very simple to read. The scikit-learn branch does not implement the optimized sparse representation, just the mini-batch part.

Edit (by OG): the branch has been merged to master. This will be part of the 0.9 release.

answered Apr 25 '11 at 15:27

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

edited May 19 '11 at 18:45

ogrisel's gravatar image

ogrisel
398464480

Yes, there is a pull request at https://github.com/scikit-learn/scikit-learn/pull/132 . Comments on the pull request or help with the code are most welcome.

(Apr 25 '11 at 16:02) Gael Varoquaux

Isn't "batch k-means" the exact opposite of incremental/online k-means?

(Apr 25 '11 at 17:22) Cerin

@Cerin technically yes, but the implementation uses mini-batches, and it's incremental if you use a batcksize of one.

(Apr 25 '11 at 17:25) Alexandre Passos ♦

@Cerin the name of the branch is wrong, but the implementation is right.

(Apr 26 '11 at 03:34) ogrisel

Should this be able to support incremental addition and removal of points? Looking through the code in scikits.learn, it only seems to support adding new points. Removing points requires rebuilding the clusters from scratch.

(May 03 '11 at 17:36) Cerin

Just to followup, although there's no removal function, the incremental addition seems to work very well.

(May 18 '11 at 10:18) Cerin
showing 5 of 6 show all

A very simple algorithm for online k-means is here, the code for which can be hacked up quite easily.

The algorithm does require you to initialize the K cluster centroids. One way to initialize them would be with the first K examples in the stream. You can then update the centroids using the subsequent examples in the stream, as described in the algorithm.

answered Apr 25 '11 at 18:55

spinxl39's gravatar image

spinxl39
3458104368

edited Apr 26 '11 at 02:45

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
467541105126

there are dozens of incremental/online kmeans code that I can list here, vfml has a C version. Which language are you seeking the code for?

answered May 20 '11 at 10:20

cglr's gravatar image

cglr
1954811

Preferably Python. I looked at VFML. Unfortunately, it hasn't been maintained in several years, and doesn't compile on most modern Linux distributions. Also, please don't double post.

(May 20 '11 at 11:19) Cerin

sorry no idea why it was double posted :). About a year ago, I compiled and used it on 64 bit ubuntu 10.04 (I don't remember well but probably I may have changed the Makefile). But I didn't test on natty.

(May 20 '11 at 11:59) cglr

Here they have an implementation applied to MRI images, the code seems to be pure matlab, which probably means is really slow, but on the upside, really easy to implement yourself.

answered Apr 26 '11 at 00:53

Leon%20Palafox's gravatar image

Leon Palafox
31265471107

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.