|
For Optical Character Recognition (OCR) implementation, in creating a training set, I don't see how to link an image of a text line with its ground truth. Assume the following image,
if I were to link it to its ground truth text - "Hello World" - an overlapping sliding window moves across the image from left to right. At each instance creating a feature vector. The confusing part is in the "overlapping", how does the window know which character in the ground truth it belongs to? In some papers, the authors concatenate the feature vectors - extracted from the overlapping sliding window - and map them to the ground truth sentence. Since training is done at the line level, how would a classifier predict a different line sentence? For example, "world hello" Thank you in advance... |
|
Hello Issam, To recognize the image you've provided, any type of classifier ( neural network, for example) could be trained to recognize the spaces between each character. For example, there is a space between the characters 'H' and 'e' in the picture you've provided above. Overlapping sliding windows could be passed through your image and feed the small fraction of the image they capture to your neural network that recognizes where there is a space between the characters. Once this problem is solved, it comes down to having another classifier (maybe even another neural network!), recognize handwritten images. For example, lets say you pass the sliding windows through the "Hello World" image and have your NN figure out the spaces. Theoretically, the image would be broken down as follows:
Where each red line represents a marking where your NN detected a space. Since the characters are all broken up, its a matter of simple character recognition and combining them to form "Hello World" Notice, however, that the space between the 'o' and 'r' in 'world' has the possibility of not being detected because the letters are so close. In which case, the sliding windows could be refined, or have a stronger classifier. Hope this helps, as this was my first post :) |
|
Standard approach is to create character level ground truth first. IE, use some heuristic to break up the line into candidate characters, then make a guess of which segment corresponds to each character, and that'll be your first training set. Use that set to train a classifier which predicts character from a given patch. This classifier will give you a good guess of what character is contained in each patch, so you can use it to go over your original candidates and make better segment-letter assignments. This will give you your second training set. Repeat several times until this process converged. This is very similar to Expectation Maximization training. Once you have good character level classifier, you can recognize whole line by using the same heuristic process. This is what's done in "Gradient based learning applied to document recognition" by LeCun http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf An alternative approach is to make "Hello World" an actual label. If you limit your output to size N, you can have a neural network with N output layers and for this example you tell it to produce "H" in first layer, "e" in second layer, "l", in third layer, etc. The remaining N-11 layers can be set to some character representing space. We've had some success using this approach for house number recognition where N=5, but it's untested and probably needs some tricks to get to work on text -- http://arxiv.org/abs/1312.6082 |
|
Standard approach is to create character level ground truth first. IE, use some heuristic to break up the line into candidate characters, then make a guess of which segment corresponds to each character, and that'll be your first training set. Use that set to train a classifier which predicts character from a given patch. This classifier will give you a good guess of what character is contained in each patch, so you can use it to go over your original candidates and make better segment-letter assignments. This will give you your second training set. Repeat several times until this process converged. This is very similar to Expectation Maximization training. Once you have good character level classifier, you can recognize whole line by using the same heuristic process. This is what's done in "Gradient based learning applied to document recognition" by LeCun http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf An alternative approach is to make "Hello World" an actual label. If you limit your output to size N, you can have a neural network with N output layers and for this example you tell it to produce "H" in first layer, "e" in second layer, "l", in third layer, etc. The remaining N-11 layers can be set to some character representing space. We've had some success using this approach for house number recognition where N=5, but it's untested and probably needs some tricks to get to work on text -- http://arxiv.org/abs/1312.6082 |
|
You might want to check out Alex Graves' work on (multi dimensional) recurrent networks, which do the segmentation (i.e. which part if the image corresponds to what character) implicitly. Finding the right chunking of pixels into characters is thus learned. This not only convenient, but also worked best for several competitions, iirc. |

