I'm applying the Porter stemmer (as implemented in Perl's Lingua::Stem package) and, naturally, it makes some errors. One that jumped out at me immediately was "several => sever." I understand that since language is highly ambiguous, the stemmer cannot possibly be perfect. Since the Porter stemmer seems to be the most popular stemmer for English, though, I was wondering if there was a standard list of exceptions that someone had developed beyond whatever is the default in Perl. A quick Google search hasn't turned up much, but any help would be much appreciated.

asked Jul 16 '10 at 10:51

Troy%20Raeder's gravatar image

Troy Raeder
89972025


6 Answers:

It will make several more errors such as "offing => of." What I recommend is that you just scan the most frequent tokens which are likely to be close class function words and make sure that nothing actually maps to those as stems. I can't think of anything better and easier off the top of my head. This problem is actually easy to solve with a statistical model, you should expect to see more regular morphological variation and also semantic consistency. A good problem for someone to work on....

answered Jul 16 '10 at 11:17

aria42's gravatar image

aria42
209972441

I've seen a paper that did that. It actually learned an unsupervised classifier that decided whether or not two words share a stem, but with a bit of handoldhing, a dictionary and a lot of patience I guess this could be made into a stemmer. I can't find the reference now, however, but there is a some work on unsupervised stemming of arabic, according to google scholar.

(Jul 16 '10 at 12:45) Alexandre Passos ♦

I've also gotten around this in a slightly different, but not entirely correct way. In my IR-related user interfaces, I need to have access to the stemmed version so I can search databases and do NLP, but on the surface, stems are ugly, so instead of using the stemmed version, I collect words into groups with the same stem, and then use the most frequent word from that group. You might get lucky and find that "several" is the only "sever"-stem word. This is a common technique for building tag clouds and other text displays, where you want to treat all the morphological versions of a word as the same thing.

answered Jul 16 '10 at 12:06

aditi's gravatar image

aditi
85072034

Maybe a better idea is to use wordnet to canonicalize the words it can, and just stem/leave unchanged the words it can't.

answered Jul 16 '10 at 12:45

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

I was going to suggest you look at Yarowsky & Wicentowski (2000), but they solve a slightly different problem: lemmatization. The difference between stemming and lemmatization is that stemming gives you a form that is not necessarily a word on its own, it's just the stem (e.g. 'giving' => 'giv'), whereas lemmatization gives you the 'lexicon entry' for the word ('giving' => 'give').

Now on your example, 'sever' might actually be the correct stem of 'several' (while it certainly is not the lemma) because etymologically, 'several' comes from Latin 'separ', which means separate. So it might be this modified stem 'separ' with the adjective ending '-al' appended.

(Maybe you really want a lemmatizer instead of a stemmer?)

answered Jul 16 '10 at 13:44

Frank's gravatar image

Frank
1349274453

edited Jul 16 '10 at 18:36

Good point. In all the NLP applications I've come across that use stemmers, a lemmatizer would be more appropriate instead

(Jul 16 '10 at 15:19) aditi

@aditi: Yes, and lemmatization is easier to solve in an unsupervised statistical way, using large corpora, as Yarwosky & Wicentowski have done. The stem, on the other hand, may never occur in any corpus (that's especially true for languages other than English), so you cannot collect statistics for its frequency, context etc.

(Jul 16 '10 at 18:42) Frank

Porter stemming algorithm it's easy to use/implement, but very simple in terms of rules. You can try other stemming algorithms (e.g., Lovins, Lancaster, ...). There are a few links to implementations in the External Links section of the Wikipedia page.

Or you can use a lemmatizer instead. They are more complex than stemmers, but can usually achieve better results in those corner cases as the one you described.

answered Jul 16 '10 at 13:49

Pedro%20Oliveira's gravatar image

Pedro Oliveira
26449

Use Morpha instead

$ echo "corpora pineapples trees data several" | ./morpha.ix86_linux -u
corpus pineapple tree datum several

answered Jul 16 '10 at 15:09

Aditya%20Mukherji's gravatar image

Aditya Mukherji
2251612

edited Jul 16 '10 at 15:09

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.