I want to play around with some feature ideas for POS tagging, so want to whip up a quick implementation. The baselines I want to compare against mostly use Maximum Entropy Markov Models. What toolkits would you recommend for this? The main one I can find is MALLET, http://mallet.cs.umass.edu/, but I'm a bit disappointed with the documentation on this so far. Is there one in C++ somewhere? There are lots of MaxEnt packages, but few MEMM toolkits that I can find.

asked Jul 20 '10 at 20:35

syllogism's gravatar image

syllogism
181139

edited Jul 20 '10 at 21:22


5 Answers:

The Stanford POS tagger is an MEMM if all used features have a rightContext() of zero. (As in the included left3words model.)

answered Jul 24 '10 at 18:05

Christopher%20Manning's gravatar image

Christopher Manning
9113

@syllogism: for your purposes (= flexible features and faster training than CRFs?), you can also have rightContext() > 0. This will not be a MEMM because the inference is just slightly different, but it will train just as fast and be somewhat more expressive. And as you seem to be more interested in the NLP side than the ML side anyways, who care if it is exactly an MEMM or not?

(Jul 25 '10 at 13:19) yoavg

@yoav Yeah I think my question was ill conceived. It really isn't important that it's an MEMM. I was interested in comparing with MEMM taggers, but the relevant dimension is the new features, so I can just compare with and without the features I'm testing.

(Jul 25 '10 at 18:28) syllogism

Why not use CRFs instead of MEMMs? I think CRFs are the state of the art, and there are a lot of libraries implementing them. You can always duplicate your baseline features and train CRFs on them.

answered Jul 20 '10 at 20:43

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1893744214333

1

As far as I'm aware the difference between MEMM and CRFs is not large when full feature sets are used, and MEMMs are faster to train. I also want to repeat these experiments for CCG supertagging, where you have over 400 labels. People usually use MEMM for this task, as the large label set makes CRFs less practical.

(Jul 20 '10 at 21:20) syllogism

There is this new CRF toolkit that's supposed to be really fast: http://wapiti.limsi.fr/. It's described in a recent ACL paper.

(Jul 22 '10 at 23:35) Frank

Turns out most of the CRF toolkits use something like the CRF++ template system, preventing me from designing arbitrary feature functions. The only two I've found that don't are MALLET and Pocket CRF. Training a POS tagger on 02-21 of the WSJ seems intractable with MALLET (over 24 hours training and only about 30 iterations), and Pocket CRF keeps seg faulting.

Sod it. Going to just train the model with MegaM, and write the classifier myself in Python.

answered Jul 22 '10 at 20:16

syllogism's gravatar image

syllogism
181139

1

I did add arbitrary word features into crfsuite without any major problem. Edge features are more complex, but you don't have a lot of flexibility with those and MEMMs anyway.

(Jul 22 '10 at 23:43) Alexandre Passos ♦

If you are using python, I found this toolkit to be effective for the maxent part. The markov-model part of memm is trivial, probably not more than 10 lines of python code.

answered Jul 23 '10 at 04:34

yoavg's gravatar image

yoavg
69671825

-1

Lingpipe has a Conditional Random Fields implementation. See this tutorial.

answered Jul 23 '10 at 19:35

Pedro%20Oliveira's gravatar image

Pedro Oliveira
23448

edited Jul 24 '10 at 13:20

1

CRFs are MEMMs are different. Most of the other answers talk about this. To illustrate, MEMMs are normalized at each token, while CRFs are normalized globally.

(Jul 23 '10 at 19:48) Alexandre Passos ♦
3

@pedro: another way to illustrate the difference between MEMM and CRFs is by describing how they actually work procedurally (not by math):

in MEMM, you train a LogLinear classifier to predict the probability of the label of each token, given features which include the k previous labels. Then, in test time, you do viterbi inference to maximize the probability of the entire label sequence (so the entire sequence is considered only in test time).

in CRF, you do forward-backward in training, and train your model to assign probabilities to entire sequences, based all the possible sequences. This, training CRFs is much harder than MEMMs. (personally I do not think it is worth it in most cases).

(Jul 23 '10 at 20:11) yoavg

My fault. That's what you get when you answer in a hurry :) Thanks for the explanation, Alexandre & yoavg

(Jul 23 '10 at 21:26) Pedro Oliveira
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.