|
I am working on an online image feature recognition program(BOW histograms) that gets objects in a live cam and extracts the SIFT features. After getting a bunch of pictures, I get the kmeans of the batch and create a global cluster. Next, I repeat the step all over and get more pics to create a new separate batch of clusters. Ideally though, I want to limit the number of clusters that I have and clusters all images at once, but this is impossible since there is an infinite amount of new data coming in. I'm not quite sure how to approach this. Here is how my code works right now:
You can see that this is bad, because I quickly get too much clusters, and each batch of clusters will definitely have overlaps with another batch. What I want:
Nothing that I've seen could accommodate that, unless I'm just not understanding them correctly. many implementations, such as data-streaming, seemingly only allows you to make the cluster in one go. After all the data are streamed, it doesn't seem so easy to update the existing cluster. I would assume though that this should be possible, but none of the code I found seems to back that up. I would appreciate any help on clearing this up. Thanks! some updates, I'm starting to look into dirichlet process. Here is something that looks like it could work: http://www.cc.gatech.edu/~jeisenst/software.html#dpmm but what's the advantage of kmeans over this however? since kmeans seems to be a de facto standard in the BOW methodology. |
|
I think you want an algorithm that looks roughly like this:
This let's you use any batch algorithm you want for the small batches you're gathering and then try to associate the local means found for the batch with global means. The global means will be created on the fly as they are needed. So at first there will be no global means and likely (if k-means did it's job right) all the local means for the first batch become global means. For the second batch, there'll likely be some clusters with local means that existed in the previous dataset. So you'll compare each local mean to existing global means, find the most similar one, and combine the local mean into that best global mean as long as their similarity is above some threshold. Setting the threshold is the hard part. You'll likely have to play around a bit to find one that works well. But with a good threshold, this will much better than just concatenating the list of local means together and over time you should see fewer and fewer global means be added as long as your dataset new distributions of features aren't dynamically being added to the dataset. I actually am doing something somewhat similar right now. Instead of a straight concat, I use vl_ubcmatch to get squared euclidean dist between each local means with the global means. Given a set threshold, similar clusters are simply removed. It does help a little but I still get pretty huge vocab after a few hundred images. Maybe your method will work a little better, but I would really rather the clusters rearrange themselves instead of being forced together by similarity. I think that would be more robust.
(Aug 17 '12 at 11:58)
mugetsu
|
|
Some solutions come to mind:
I can't go with 1, since i expect some clusters to be merged and new clusters form. That means that only 2 and 3 are viable. The problem that I've had with online k-means is that I have really had a hard time finding anything about it. Would online k-means work better than Dirichlet process?
(Aug 17 '12 at 11:52)
mugetsu
|