Suppose I have an OCR engine that makes 3 kinds of mistakes:

  • substitutes 1 character for another 1
  • merges 2 characters into 1, and
  • splits 1 characters into 2.

I'm given pairs of string (correct and observed word) and need to learn a model of P(observed|correct).

Each pair can have several different error sequences associated with it, so it makes sense to model error sequence as a hidden variable and use some kind of Expectation Maximization

The problem is that optimal E step doesn't have closed form, so some compromises need to be made. My question -- has anyone seen a very similar model in the literature, and how did they train it?

asked May 11 '11 at 00:10

Yaroslav%20Bulatov's gravatar image

Yaroslav Bulatov
1963193458

edited May 11 '11 at 00:10


One Answer:

Indeed, I think this is very similar to the Brown alignment models used in machine translation. I could be wrong, but they also have intractable E steps (this is somewhat dealt with by clever initialization and approximations). See the original Brown et al paper, the mathematics of machine translation, for more details.

answered May 11 '11 at 02:44

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

Another thing that might help you is using hard EM (or viterbi EM), in which you maximize in the E step instead of computing an expectation, as maximization in this model is just the usual dynamic programming text alignment/diff algorithm.

(May 11 '11 at 14:42) Alexandre Passos ♦

Well, if you do E-step on a per-example basis, then it's tractable. In other words, you compute separate distribution over paths for each example. I was getting non-closed form when I tried to get the E step by following steps in Minka's "Expectation Maximization as lower bound maximization" -- distribution over hidden variables seems to jointly depend on all training examples

(May 11 '11 at 23:53) Yaroslav Bulatov
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.