I've been doing a lot of research on how to achieve what I initially thought was fairly straightforward. But after looking around it seems like the readily available solutions only can do part of what I need.I'm beginning to think that it's impossible to easily implement.

So what I want to do is implement an algorithm that allows for online training of unknown objects.I would use a saliency map to find the most important object and if undefined, train the object. We then loop through this process many times with as many objects as possible. Finally, after learning every important object in the scene, the algorithm proceed to track and recognize all those objects.

At this point I still have methods to accomplish the above. But when I put into the requirement of allowing thousands and more objects to be trained, many of the methods I've found would run too slowly. The only tools I have are openCV functions and some saliency map extensions. What can I do?

asked Oct 19 '11 at 13:53

mugetsu's gravatar image

mugetsu
233212431


One Answer:

If you have so many objects, I think you should look at the image retrieval literature. Try Cordelia Schmid as a starting point. Or maybe even better, Florent Perronnin but at the moment I can't find the papers I'm looking for :( Basic ideas are to compress descriptors and have inverted file systems to quickly find corresponding object. There was a cool demo online but I can't find that either :( Maybe start at the "video google" paper...

This answer is marked "community wiki".

answered Oct 19 '11 at 15:11

Andreas%20Mueller's gravatar image

Andreas Mueller
2686185893

edited Oct 19 '11 at 15:39

thanks for the suggestion, the descriptor and inverted file system sounds great. what was the online demo displaying? something close to what I described? Or just a demo of many objects recognition(offline training, etc)?

Has anyone done what I described?

Right now I'm thinking of using the following algorithms:

iLab C++ toolkit for saliency map HOG descriptors for object recognition. while online:

  1. saliency map finds most interesting objects

  2. determine the descriptors of these objects

  3. inverted file system lookup, if found, identify object

  4. else remember object in file system

  5. loop

not quite sure if HOG is appropriate here, maybe this is too simple?

(Oct 19 '11 at 15:46) mugetsu

The demo was "upload a picture, we find similar ones in seconds on consumer hardware". So it's more or less what you described, minus the tracking.

I think building the inverted file system is offline, but I'm not sure. I have actually no idea about image retrieval ;)

Oh, there is an obvious demo to what you want: google goggles. Though that probably doesn't run on consumer hardware ;)

About the features: I would definitely use bag of sift features. So you find interest points, describe them with SIFT, find cluster assignments with a codebook, build a histogram and then probably compress that or hash that or something (again, not my field of work but there are paper about that out there).

Doing that is probably not possible at any frame rate but should be possible in the order of a second, I think.

You might need to invest in a CUDA capable GPU...

Again, I am not an expert in the field, but I think this is probably quite a big project...

(Oct 19 '11 at 16:40) Andreas Mueller

I've seen many good solutions that can't be done real-time. This leads me to thinking if real-time is necessary. What I want to do is recognize all the objects in a scene and learn interesting new objects. All of this is done one object at a time. If it only takes 1 second to do feature recognition with SIFT, then is that such a big deal in my case?

If SIFT can still do what I need, I'm thinking of the follow setup:

1.Using the iLab toolkit find the most salient object

2.Use VLFeat SIFT to get feature descriptor for the salient object

3.object recognition:SIFT feature matching

4.no match: feature descriptor is saved to a database

Is this ok?

I realize that SIFT classifiers require quite a bit of object descriptors per class and that there is no way I could in real-time(or close to real-time) train a class in such a way. What I could do is aggregate a class definition over time. So when I see a cup object for the first time, I save its descriptor and add that to the cup class. The next time I see an object of class cup, I would add the new descriptor(obtained real-time) into the database. Once I obtain enough descriptors in the database, then I will allow the cup class to be recognized, until then I will only recognize the specific matches of objects that I've seen.

(Oct 20 '11 at 15:38) mugetsu

Yeah, that sounds reasonable. If you save features separately, you'll get to many point, though. Imagine having 100 points per object and 1000 objects and trying to do nearest neighbors. I think that will be to slow, but I'm not really sure. So maybe try matching the descriptors first and if that's to slow use vector quantization or histograms. Have you looked at the "video google" paper yet?

(Oct 20 '11 at 17:02) Andreas Mueller

About the speed of SIFT: I think vlfeat is pretty much the best code for that, I'm not totally sure of the runtime. But I think it's more like 0.3 sec than 0.01 sec. So that's not really realtime but may be good enough.

(Oct 20 '11 at 17:03) Andreas Mueller

No I haven't taken a look at the google paper yet. Embarrassingly I can't seem to find it on the page you directed me to.... As for the nearest neighbor, I think I can optimize that with the Best Bin First algorithm for returning nearest neighbors fast. I will try that out, thanks!

(Oct 20 '11 at 17:33) mugetsu

Oh sorry, the video google paper was not on any of the pages, you can find it via google scholar: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.62.1951&rep=rep1&type=ps

(Oct 21 '11 at 05:04) Andreas Mueller
showing 5 of 7 show all
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.