For Optical Character Recognition (OCR) implementation, in creating a training set, I don't see how to link an image of a text line with its ground truth.

Assume the following image,

enter image description here

if I were to link it to its ground truth text - "Hello World" - an overlapping sliding window moves across the image from left to right. At each instance creating a feature vector. The confusing part is in the "overlapping", how does the window know which character in the ground truth it belongs to?

In some papers, the authors concatenate the feature vectors - extracted from the overlapping sliding window - and map them to the ground truth sentence. Since training is done at the line level, how would a classifier predict a different line sentence? For example, "world hello"

Thank you in advance...

asked Dec 19 '13 at 05:20

Issam%20Laradji's gravatar image

Issam Laradji
1217912

edited Dec 19 '13 at 05:21


4 Answers:

Hello Issam,

To recognize the image you've provided, any type of classifier ( neural network, for example) could be trained to recognize the spaces between each character. For example, there is a space between the characters 'H' and 'e' in the picture you've provided above.

Overlapping sliding windows could be passed through your image and feed the small fraction of the image they capture to your neural network that recognizes where there is a space between the characters. Once this problem is solved, it comes down to having another classifier (maybe even another neural network!), recognize handwritten images.

For example, lets say you pass the sliding windows through the "Hello World" image and have your NN figure out the spaces. Theoretically, the image would be broken down as follows:

alt text

Where each red line represents a marking where your NN detected a space. Since the characters are all broken up, its a matter of simple character recognition and combining them to form "Hello World"

Notice, however, that the space between the 'o' and 'r' in 'world' has the possibility of not being detected because the letters are so close. In which case, the sliding windows could be refined, or have a stronger classifier.

Hope this helps, as this was my first post :)

answered Dec 30 '13 at 08:14

Alejandro%20Lucena's gravatar image

Alejandro Lucena
11

edited Dec 30 '13 at 13:30

Standard approach is to create character level ground truth first. IE, use some heuristic to break up the line into candidate characters, then make a guess of which segment corresponds to each character, and that'll be your first training set. Use that set to train a classifier which predicts character from a given patch. This classifier will give you a good guess of what character is contained in each patch, so you can use it to go over your original candidates and make better segment-letter assignments. This will give you your second training set. Repeat several times until this process converged. This is very similar to Expectation Maximization training. Once you have good character level classifier, you can recognize whole line by using the same heuristic process.

This is what's done in "Gradient based learning applied to document recognition" by LeCun http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf‎

An alternative approach is to make "Hello World" an actual label. If you limit your output to size N, you can have a neural network with N output layers and for this example you tell it to produce "H" in first layer, "e" in second layer, "l", in third layer, etc. The remaining N-11 layers can be set to some character representing space.

We've had some success using this approach for house number recognition where N=5, but it's untested and probably needs some tricks to get to work on text -- http://arxiv.org/abs/1312.6082

answered Dec 30 '13 at 20:25

Yaroslav%20Bulatov2's gravatar image

Yaroslav Bulatov2
16112

Standard approach is to create character level ground truth first. IE, use some heuristic to break up the line into candidate characters, then make a guess of which segment corresponds to each character, and that'll be your first training set. Use that set to train a classifier which predicts character from a given patch. This classifier will give you a good guess of what character is contained in each patch, so you can use it to go over your original candidates and make better segment-letter assignments. This will give you your second training set. Repeat several times until this process converged. This is very similar to Expectation Maximization training. Once you have good character level classifier, you can recognize whole line by using the same heuristic process.

This is what's done in "Gradient based learning applied to document recognition" by LeCun http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf‎

An alternative approach is to make "Hello World" an actual label. If you limit your output to size N, you can have a neural network with N output layers and for this example you tell it to produce "H" in first layer, "e" in second layer, "l", in third layer, etc. The remaining N-11 layers can be set to some character representing space.

We've had some success using this approach for house number recognition where N=5, but it's untested and probably needs some tricks to get to work on text -- http://arxiv.org/abs/1312.6082

answered Dec 30 '13 at 20:25

Yaroslav%20Bulatov2's gravatar image

Yaroslav Bulatov2
16112

You might want to check out Alex Graves' work on (multi dimensional) recurrent networks, which do the segmentation (i.e. which part if the image corresponds to what character) implicitly. Finding the right chunking of pixels into characters is thus learned.

This not only convenient, but also worked best for several competitions, iirc.

answered Dec 31 '13 at 17:22

Justin%20Bayer's gravatar image

Justin Bayer
170693045

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.