Does anyone know an unsupervised lemmatizer? If not, a stemmer would also be fine.

My main requirement is that you can just throw a text corpus at it (in, say, some inflected language like German or Czech), and it outputs the lemma for each corpus token.

The other requirement is that it is a downloadable tool.

Morfessor comes close, but it does not output lemmas, but a morphological segmentation. As far as I know, the stem is not even marked? So it seems hard to use as a stemmer, although one could use some heuristics to identify one of the segments as the stem.

Thanks!

asked Mar 28 '11 at 01:57

Frank's gravatar image

Frank
1349274453


2 Answers:

TreeTagger: http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

Typically used in POS annotation but it also outputs lemmas. If I recall correctly it's not unsupervised but parameter files in several languages english/french/german are available so you can use it out of the box on new data. There's a Python wrapper as well.

Newer nltk versions have Porter/Snowball stemmers for several languages. Czech ... I am not sure though.

answered Mar 28 '11 at 19:26

David%20the%20Dude's gravatar image

David the Dude
60458

Morphessor has a version (Morfessor Categories-MAP) that produces structured output: each morph is classified as either prefix, stem, suffix (or "none").

For example, the word "straightforwardness" is segmented into "straigt/STEM" + "forward/STEM" + "ness/SUFFIX", and the Finnish word "oppositiokansanedustaja" into "oppositio/STEM" + "kansa/STEM" + "n/SUFFIX" + "edusta/STEM" + "ja/SUFFIX".

You could create a "poor man's stemmer" if you just picked the morphs tagged as stems for each word.

answered Mar 29 '11 at 08:23

paraba's gravatar image

paraba
256288

edited Mar 29 '11 at 08:41

Thanks, that comes close. I don't think that's a "poor man's stemmer", but it really is an actual stemmer. (It's even a "rich man's stemmer" because it give you more than just the stems.)

(Mar 29 '11 at 14:45) Frank
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.