I am attempting to learn the number that produced some noisy text produced from using OCR on an image.
Example:

ABC: $1,234.S67.00 XYZ <-  1234597.00

These are all balances and I am thinking of framing this as a semi-CRF problem following Sarawagi and Cohen's Semi-Markov Conditional Random Field paper. Where I would have labeled segments:

span_1:'ABC: $'   label: START
span_2:'1,'       label: 1
span_2:'2'        label: 200
span_2:'3,'       label: 30
span_2:'4,'       label: 4
span_2:'S'        label: 500
span_2:'6,'       label: 60
span_2:'7'        label: 7
span_2:'.00'      label: DECIMAL
span_2:' XYZ'     label: END

I am looking for responses of this nature: Any ideas as to if this is a good/bad approach, or improvement? Alternative approaches? Any experienced antidotes in doing this in the past? Links to other good papers related to this topic

asked Jan 10 '12 at 13:39

Brent%20Payne's gravatar image

Brent Payne
80239


One Answer:

One thing I'd avise against is using labels such as 200 versus a label such as 2, as this will grow your parameter space needlessly. It is much better to have a specific label for the comma and period, as this will learn things such as comma-comma being unlikely, and if you use a higher-order model it will be able to learn things such as XXX, being a popular 4-gram.

For this noisy recovery text you might be lucky with methods that explicitly learn edit distance transducers, such as McCallum et al A conditional random field for discriminatively trained finite-state edit distance.

answered Jan 10 '12 at 17:20

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.