|
I am attempting to learn the number that produced some noisy text produced from using OCR on an image.
These are all balances and I am thinking of framing this as a semi-CRF problem following Sarawagi and Cohen's Semi-Markov Conditional Random Field paper. Where I would have labeled segments:
I am looking for responses of this nature: Any ideas as to if this is a good/bad approach, or improvement? Alternative approaches? Any experienced antidotes in doing this in the past? Links to other good papers related to this topic |
|
One thing I'd avise against is using labels such as 200 versus a label such as 2, as this will grow your parameter space needlessly. It is much better to have a specific label for the comma and period, as this will learn things such as comma-comma being unlikely, and if you use a higher-order model it will be able to learn things such as XXX, being a popular 4-gram. For this noisy recovery text you might be lucky with methods that explicitly learn edit distance transducers, such as McCallum et al A conditional random field for discriminatively trained finite-state edit distance. |