Recently I have implemented a CCG parser following Hockenmaier(2003), but when testing on actual data the performance is not good since there are many OOVs. This might be the case that the training data is not enough, but a manually annotated corpus costs too much.

My way is to use the word's POS tag to predict its category. The equation is: p(word|cat) = sum_pos p(word|pos)p(pos|cat). But the accuracy is only 52%.

Another approach is to use more information to calculate the p(word|cat), such as the word's POS tag, it's parent's POS tag and(or) it's brother's tag, etc. After doing this, the performance is better (80%), but still not enough to use as a training data.

It seems that the model's complexity grows fast when more and more information added. Is there a method to estimate the OOV's category with high accuracy?

asked Jul 15 '14 at 08:58

Huijia%20Wu's gravatar image

Huijia Wu
51131518

Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.