Recently, I realized my pipeline needs to support text in multiple languages. My application relies on POS taggers and parser that assume the text is all in English. Is there a good resource for taggers and parser that work on other languages?

asked Oct 07 '10 at 14:37

The best way I know of going around this problem is using multilingual data to train standard parsers and taggers. A good source of this data is the CoNLL X shared task on multilingual dependency parsing, which includes POS tagging data in the treebanks they provide.

Apart from this you might get good enough results using unsupervised POS tagging on some languages where you have a lot of data and use the obtained word token clusters as features in your process.

Also, see the other answers of this question.

answered Oct 07 '10 at 15:22

This paper is about directly solving the multilingual problem. I don't think there's easy software for that.

You can train stanford POS tagger or the stanford parser, or mstparser on this conll data and get good enough results. For other languages, the approach of finding a standard treebank and training a parser/tagger should be a good idea as well.

If you want research on learning many languages at once, a cool paper is Dependency Grammar Induction via Bitext Projection Constraints, by Kuzman Ganchev, Jennifer Gillenwater and Ben Taskar.

All the recent "multilingual" type work is kind of cool from a research perspective, but does not really provide good results.

Currently, if you want to actually tag text in some language, your best bet would be to find a specific tagger adapted and trained for this language. TreeTagger performs well for many languages and has many parameter files available.

For parsing, my advice would be to get a treebank for the language of choice, and train the berkeley parser. It seem to work best out-of-the-box for any language it is tested on.

answered Oct 10 '10 at 21:19

Is there a resource for treebanks? I am happy to post a page collecting them in one place.

You can also take a look at "Multilingual Part-of-Speech Tagging: Two Unsupervised Approaches" paper to learn POS tagger in multiple language at the same time. It shows that simultaneous learning in multiple languages is beneficial.

answered Oct 07 '10 at 23:23

It is only beneficial in a severely crippled unsupervised setting, where very little text is actually used.

