I am trying to do OCR from screenshots that contain normal computer rendered text (not scanned, no captcha tricks, a constant font size). Text as you'd normally see it on a computer screen displaying a web page or a text editor.
My initial attempt to attack this problem was to capture images of all the individual characters. Then calculate correlations between each of the captured characters and the left most location in a text line containing data (other than background color), then select the most correlated character and advance by the width of the matched character.
There are couple of obvious problems with the above approach. First, the font used for rendering is not a bitmap font, but a vector based true type font, with anti-aliased rendering results. Because of this, the captured images of the individual characters are not quite the same as the character images in the rendered text. Second, the approach to advancing in the text is quite naive, breaking down the further we go within a text line (due to anti-aliasing, subpixel resolution used in rendering, hinting, etc). I've tried to remedy the second by subtracting away the first matched character from the image, then finding the next left most position of data from the image. The problem with this, again, is the difference between the captured character images and the images in the actual data, leaving me with random artifacts around the edges of the characters, that can confuse further matching.
This would seem like an excellent place to use some machine learning algorithms, but not having real life experience in using them, I'm not quite sure on what approach I should take. Most of the web pages I've found on the subject seem to deal with reading individual characters using neural networks. To do this, I would first need to be able to extract the characters from each other, which by itself seems hard to do (due, again, to anti-aliasing, and font size of about 10 pixels from top to bottom).
How should I approach this problem ? What specific algorithms do you recommend I use for this?
asked Nov 15 '10 at 01:01
You can extract individual characters by adaptive thresholding (ie, each pixel is black or white depending whether it's darker than median intensity in the neighborhood), and then taking bounding boxes of connected blobs of black pixels as your candidates.
Once you have your character candidates, you need to extract features for each candidate. Feature set described in "Gradient-based contour encoding for character recognition" is easy to implement and works well. Basically you compute gradient direction for each pixel, then break character bounding box into 5x5 grid and for each square your features give the number of "South" gradient directions, "South-West" gradient directions, etc.
Then you can use these feature sets as basis for training your classifier. I used this approach to process video frames, so it would probably work for screen text too.
Local thresholding followed by erosion
Local thresholding followed by dilation