|
I'm looking at the following scenario -- I have a number of characters which have both synthetic ground truth (fonts) and natural scene images (SVHN). Other characters only have synthetic ground truth. I want my character classifier to recognize all characters in natural scene images. Does anyone know of someone doing something similar? I found some related work on LeCun and Hinton on learning invariant mappings, but none that applied this idea to this problem. |
|
The first thing that came to mind was tangent propagation, since you can proof that using it and training with a large space of transformations for your input are very similar things. On that note, I read Bengio's paper on Manifold Tangent, so they create a domain knowledge - free classifier. |
|
Your question is really one of feature representation. If you can learn a "big" enough representation, then your problem becomes a non-problem. The representation only needs to be learned once and then can be applied to most areas of computer vision - it would be a human-competitive recognition system. Until recently the machine learning community had failed to create a method for learning such a large representation and had been too narrowly focused on problem-specific learning (while wasting computation resources discovering and rediscovery feature representations). A few days ago a breakthrough was made in large scale feature learning - the encoder graph - which allows to learn once and for all a feature representation that can apply to most of computer vision. This only needs to be learned once, but requires large resources for that one time. Approximately 1 million cores for a few days and a corpus of 1 billion to 10 billion natural images. After the "featurizer" has been learned, using it is just a mattter of giving it images, getting featurized data back, and using supervised classification techniques, such as random forests or logistic regression. So, really your problem is a non-problem if the computer vision field makes a one time investment. |
|
Make a huge combined data set.
This is similar to how LeCun runs many distortions over MNIST to improve accuracy. If you want to get even more creative, you can apply a curriculum strategy, and start with the easiest images first (no bg, then light bg, then full bg, then filters). But I think just making a huge combined set would be good enough. That's very similar to what I started with, and the results are stil below human accuracy. I think perhaps my synthetic examples are not capturing the range of variation present in photos. http://yaroslavvb.com/upload/save/js1.png Serge Belongie pointed out relevant reference -- "Separating Style and Content with Bilinear Models"
(Aug 27 '12 at 01:04)
Yaroslav Bulatov
|
Jurgen Schmidhuber does work where he creates more data (e.g. by distortions on images) and trains neural networks on it. So he might have done work where he inserts the synthetic data into natural images and trained on that. He's beat yann lecun in competitions.
http://www.idsia.ch/~juergen/
Have you tried doing what the kinect people did and using the synthethic ground truth to generate zillions of training examples? You could, for example, write the character on a texture and use a rendered to put this texture on a surface and from it render an image, which you can then corrupt with some realistic-ish noise. By weighting this synthethic data and the real data accordingly you should be able to do pretty well, I hope.