|
I have a document corpus and I am trying to perform clustering of keywords associated with these documents to derive the major topics/genres. I am trying to find all related words for a a given word. For instance the different forms(ignoring the parts of speech) of the word "baby-sitter" could be "baby-sit", "baby-sitting", "baby-sits" etc. I want all these words to map to a single base word (say "baby-sit") so that they are not considered as distinct words. Is there a tool available online? I have tried wordnet(online) but do not directly get all the forms of the word. Any direction on the use of wordnet for this task or information on an already existing tool would be helpful. |
|
There are two different technieques for base-form reduction. Stemming is algorithmically simpler, but does not produce "human-readable" base forms, just truncates suffixes basically. Lemmatization reduces to a human base form, so mice -> mouse etc. Porter's algorithm for stemming English is pretty much the de facto standard, although there are improvements (some due to Porter himself) which would be useful to include. If your target language is not English, you will need to look elsewhere. For lemmatizers, it's less straightforward to give a recommendation, but if you are familiar with Python and NLTK, that's one obvious starting point. One of the top Google hits is http://text-processing.com/demo/stem/ which has more discussion and pointers. |