|
If I have an image of a table, then it stands to reason that it's not beside a dolphin. How do I incorporate such knowledge in my object recognition model? I am particularly interested in large-scale tasks, where there might be 100K or millions of objects. |
|
In general I'd say there are two distinct ways you can go about this: (1) using language-model-like things to model images and (2) linking a language model to things like image captions and use this in a larger system. The first paper I know of that did (2) in an interesting way was by Koby Barnard, Nando de Freitas, Forsyth, and Dave Blei, Matching words and pictures. From Fei-Fei Li's lab there are lots of examples of both (1) and (2). I like Spatially coherent latent topic model for concurrent object segmentation and classification, What, Where and Who? Telling the Story of an Image by Activity Classification, Scene Recognition and Object Categorization, and A Bayesian Hierarchical Model for Learning Natural Scene Categories. Of course, if by language model you mean n-grams then I think you have a structure mismatch, but even then some similar ideas might apply (if you squint a bit I think you can say that grid MRFs as used in vision are kind of n-gram-ish). |
|
Alexandre's suggestions are probably the most prominent in the field and are a good starting points. There is another work that I quite like by Du, L. and Ren, L. and Dunson, D. and Carin, L.: A Bayesian model for simultaneous image clustering, annotation and object segmentation and the follow up Logistic Stick-Breaking Process This is a hierarchical Bayesian nonparametric method for joint segmentation and annotation. I would guess that it does not scale to 100k images but the approach is quite interesting. If you are looking at so many images, it's hard to say what will actually work. I think most graphical model based methods will break down at this scale. Do you want to find bounding boxes using annotations? There is current work in the imagenet challenge and work by Torralba on the Sun dataset in this directions. But on this scale, "just" doing bounding box based recognition is quite challenging and new methods have to be developed. |
|
You could look at generative models (eg, RBMs, deep belief nets, etc). They use models wherein if you sample from the model it has a high probability of generating something similar to images in the training set. They use latent variables that act as feature detectors, and the activities of these feature detectors can then be used to learn deeper structures within the original data. Repeated many times over, this process can learn powerful models that can then be used to identify objects in a scene, but they can be computationally expensive. The largest network I recall reading about had 10,000 (eg, a 100x100 pixel B&W image) inputs and required a decent amount of time to train on a fairly powerful nVidia GPU card. Then again, there's been some recent advancements in training DBNs using 2nd order optimization techniques that significantly improved training times and performance. -Brian
This answer is marked "community wiki".
|