|
In my application, I am performing batch kmeans clusters on a bunch of sift descriptors. There is no limit to how many times that I would perform these batches, nor a limit on the number of sift descriptors I have to batch. Simply summing all the clusters from each batch is inefficient because there are bound to be clusters with very similar data, and we quickly get too many redundant clusters. Initially I looked into data streaming and incremental clusterings, but those didn't seem to apply to me, since from my understanding both methods require that I have all the data on hand. But in my case, the sift descriptors comes in an online application and will only be received one at a time, there is no end to this process so I need the ability to constantly update my clusters or create/combine new clusters. Ideally it would be good to be able to merge clusters with older clusters. Is this possible? Or are there any other approaches I could take? EDIT: a better way to word this, is it possible to do kmeans clustering to cluster an existing set of clusters and new unclustered data? |
|
Your problem sounds very similar to that of clustering multiple datasets while wanting the clusters found in each dataset to be connected. Have you read Revisiting k-means: New Algorithms via Bayesian Nonparametrics? Section 4 on Clustering with Multiple Data sets sounds like it might solve your problem. The basic idea revolves around the Dirichlet Process, which can converge to K-Means under certain seetings. Using this idea, you maintain two sets of clusters
For each batch of data, you assign each data point to a local cluster as long as the smallest distance is below the threshold thanks for the paper. I just want to clear up a confusion I'm having. Would sequential kmeans work for my case? It allows for acquiring data over time, but I'm not sure if its restricted to me knowing beforehand the size of the dataset. The biggest issue I've had with similar approaches is that they never seem to allow me to go back to a completed cluster and add more to it, regardless of if it lets me add to a cluster bit by bit. Am I just not understanding these concepts right?
(Aug 13 '12 at 13:55)
mugetsu
I'm not sure what you mean when you refer to "Sequential K-Means." Are you referring to the standard k-means algorithm? The one I suggested? Or a streaming approach like Streaming K-Means? For vanilla K-Means, your issue would be knowing how many clusters to create. The Dirichlet Process inspired algorithm I noted above side steps this issue and simply creates new clusters for each dataset and then associates each cluster with global clusters (which also get created as needed). Every time you associate, or break the association, with a local cluster to a global cluster, you update the mean of the global cluster to reflect where those data points go.
(Aug 14 '12 at 06:35)
Keith Stevens
this is where I got the term sequential: http://www.cs.princeton.edu/courses/archive/fall08/cos436/Duda/C/sk_means.htm on another note, I was able to get the implementation from the author
(Aug 14 '12 at 12:19)
mugetsu
|