I'm looking at the following scenario -- I have a number of characters which have both synthetic ground truth (fonts) and natural scene images (SVHN). Other characters only have synthetic ground truth. I want my character classifier to recognize all characters in natural scene images. Does anyone know of someone doing something similar?

I found some related work on LeCun and Hinton on learning invariant mappings, but none that applied this idea to this problem.

asked Aug 22 '12 at 18:23

Yaroslav%20Bulatov's gravatar image

Yaroslav Bulatov
2333214365

edited Aug 22 '12 at 20:01

Jurgen Schmidhuber does work where he creates more data (e.g. by distortions on images) and trains neural networks on it. So he might have done work where he inserts the synthetic data into natural images and trained on that. He's beat yann lecun in competitions.

http://www.idsia.ch/~juergen/

(Aug 22 '12 at 19:18) marshallp

Have you tried doing what the kinect people did and using the synthethic ground truth to generate zillions of training examples? You could, for example, write the character on a texture and use a rendered to put this texture on a surface and from it render an image, which you can then corrupt with some realistic-ish noise. By weighting this synthethic data and the real data accordingly you should be able to do pretty well, I hope.

(Aug 28 '12 at 00:00) Alexandre Passos ♦

3 Answers:

The first thing that came to mind was tangent propagation, since you can proof that using it and training with a large space of transformations for your input are very similar things.

On that note, I read Bengio's paper on Manifold Tangent, so they create a domain knowledge - free classifier.

answered Aug 23 '12 at 04:59

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128

-2

Your question is really one of feature representation. If you can learn a "big" enough representation, then your problem becomes a non-problem. The representation only needs to be learned once and then can be applied to most areas of computer vision - it would be a human-competitive recognition system.

Until recently the machine learning community had failed to create a method for learning such a large representation and had been too narrowly focused on problem-specific learning (while wasting computation resources discovering and rediscovery feature representations).

A few days ago a breakthrough was made in large scale feature learning - the encoder graph - which allows to learn once and for all a feature representation that can apply to most of computer vision. This only needs to be learned once, but requires large resources for that one time. Approximately 1 million cores for a few days and a corpus of 1 billion to 10 billion natural images.

After the "featurizer" has been learned, using it is just a mattter of giving it images, getting featurized data back, and using supervised classification techniques, such as random forests or logistic regression.

So, really your problem is a non-problem if the computer vision field makes a one time investment.

answered Aug 24 '12 at 14:30

marshallp's gravatar image

marshallp
8391016

Make a huge combined data set.

  1. Find patches of the natural images that don't have any characters in them. I assume it wouldn't be too hard to train a classifier to do this with 99% accuracy. Even if there's a little noise it's fine.

  2. Make a combined data set: Image with a synthetic numeral, underlaying the natural image patch. You can use no natural image patch, 100% natural image patch (as bottom layer), 50% natural image patch. etc.

  3. Optional: Apply a filter over the combined image, e.g. some jitter or blur.

  4. Train. (Profit?)

This is similar to how LeCun runs many distortions over MNIST to improve accuracy.

If you want to get even more creative, you can apply a curriculum strategy, and start with the easiest images first (no bg, then light bg, then full bg, then filters). But I think just making a huge combined set would be good enough.

answered Aug 24 '12 at 15:29

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

That's very similar to what I started with, and the results are stil below human accuracy. I think perhaps my synthetic examples are not capturing the range of variation present in photos. http://yaroslavvb.com/upload/save/js1.png Serge Belongie pointed out relevant reference -- "Separating Style and Content with Bilinear Models"

(Aug 27 '12 at 01:04) Yaroslav Bulatov
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.