I am trying to do OCR from screenshots that contain normal computer rendered text (not scanned, no captcha tricks, a constant font size). Text as you'd normally see it on a computer screen displaying a web page or a text editor.

My initial attempt to attack this problem was to capture images of all the individual characters. Then calculate correlations between each of the captured characters and the left most location in a text line containing data (other than background color), then select the most correlated character and advance by the width of the matched character.

There are couple of obvious problems with the above approach. First, the font used for rendering is not a bitmap font, but a vector based true type font, with anti-aliased rendering results. Because of this, the captured images of the individual characters are not quite the same as the character images in the rendered text. Second, the approach to advancing in the text is quite naive, breaking down the further we go within a text line (due to anti-aliasing, subpixel resolution used in rendering, hinting, etc). I've tried to remedy the second by subtracting away the first matched character from the image, then finding the next left most position of data from the image. The problem with this, again, is the difference between the captured character images and the images in the actual data, leaving me with random artifacts around the edges of the characters, that can confuse further matching.

This would seem like an excellent place to use some machine learning algorithms, but not having real life experience in using them, I'm not quite sure on what approach I should take. Most of the web pages I've found on the subject seem to deal with reading individual characters using neural networks. To do this, I would first need to be able to extract the characters from each other, which by itself seems hard to do (due, again, to anti-aliasing, and font size of about 10 pixels from top to bottom).

How should I approach this problem ? What specific algorithms do you recommend I use for this?

asked Nov 15 '10 at 01:01

Sami's gravatar image


One Answer:

You can extract individual characters by adaptive thresholding (ie, each pixel is black or white depending whether it's darker than median intensity in the neighborhood), and then taking bounding boxes of connected blobs of black pixels as your candidates.

Once you have your character candidates, you need to extract features for each candidate. Feature set described in "Gradient-based contour encoding for character recognition" is easy to implement and works well. Basically you compute gradient direction for each pixel, then break character bounding box into 5x5 grid and for each square your features give the number of "South" gradient directions, "South-West" gradient directions, etc.

Then you can use these feature sets as basis for training your classifier. I used this approach to process video frames, so it would probably work for screen text too.


Local thresholding followed by erosion

Local thresholding followed by dilation

function bw=localthreshold(IM,ws,st)
% LOCALTHRESHOLD does local thresholding for black, selects regions that
% are less than mean-std*st, where std is local standard deviation
if (nargin<3)
    error('You must provide the image IM, the window size ws, and st.');



SE = strel('disk',1);
RGB_label= label2rgb(labeled, @spring, 'c', 'shuffle');

answered Nov 15 '10 at 03:19

Yaroslav%20Bulatov's gravatar image

Yaroslav Bulatov

edited Nov 17 '10 at 23:44

Unfortunately the font rendering engine in this case seems to place quite a few dark pixels next to each other from different glyphs. Also, the due to hinting, the left and right edges of the glyphs can occupy same pixels additively, causing many of the glyphs to have seemingly connected points when looking at them highly magnified. The type of adaptive thresholding you're suggesting does a poor job trying to extract the characters.

(Nov 17 '10 at 01:57) Sami

Can you give a sample screenshot?

(Nov 17 '10 at 05:02) Yaroslav Bulatov

Actually, looks like I ended up finding the answer to my problem in the form of a doctoral thesis addressing the exact problem I'm facing:

RECOGNITION OF ULTRA LOW RESOLUTION, ANTI-ALIASED TEXT WITH SMALL FONT SIZES http://ethesis.unifr.ch/theses/downloads.php?file=EinseleF.pdf

(Nov 17 '10 at 11:30) Sami

Adding here another link I found based of the above thesis addressing the same problem:

Recognition of Screen-Rendered Text http://cvpr.uni-muenster.de/research/ScreenTextRecognition

(Nov 17 '10 at 13:27) Sami

I've added an example above. Local thresholding/dilation can separate characters even if they are touching, but each character may have more than one region associated with it

(Nov 17 '10 at 17:57) Yaroslav Bulatov
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.