|
Does anyone know an unsupervised lemmatizer? If not, a stemmer would also be fine. My main requirement is that you can just throw a text corpus at it (in, say, some inflected language like German or Czech), and it outputs the lemma for each corpus token. The other requirement is that it is a downloadable tool. Morfessor comes close, but it does not output lemmas, but a morphological segmentation. As far as I know, the stem is not even marked? So it seems hard to use as a stemmer, although one could use some heuristics to identify one of the segments as the stem. Thanks! |
|
TreeTagger: http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ Typically used in POS annotation but it also outputs lemmas. If I recall correctly it's not unsupervised but parameter files in several languages english/french/german are available so you can use it out of the box on new data. There's a Python wrapper as well. Newer nltk versions have Porter/Snowball stemmers for several languages. Czech ... I am not sure though. |
|
Morphessor has a version (Morfessor Categories-MAP) that produces structured output: each morph is classified as either prefix, stem, suffix (or "none"). For example, the word You could create a "poor man's stemmer" if you just picked the morphs tagged as stems for each word. Thanks, that comes close. I don't think that's a "poor man's stemmer", but it really is an actual stemmer. (It's even a "rich man's stemmer" because it give you more than just the stems.)
(Mar 29 '11 at 14:45)
Frank
|