|
Hi, I would like to calculate the frequency of function words in Python/NLTK. I see two ways to go about it :
The catch in the first case is that, my data is noisy and I don't know(for sure) which POS tags constitute as function words. The catch in the second case is I don't have a list and since my data is noisy the lookup won't be accurate. I would prefer the first to the second or any other example which would throw me more accurate results |
|
Using a list should be easiest, as you have done. If you know the language, you can sort words by frequency and then manually select the function words. For english, Wikipedia has a list of prepositions and a bit of googling should turn up similar lists of pronouns etc. There are also free lists of stop words already made for you that you can find online as well (of varying quality), I found a few with my first google search. Here is one for english. It should be easy to check the list to see if it has any words you don't want to include and you can check your corpus manually for words you wish to include, but don't yet have on the list. Gdahl, Is it safe to make the assumption that stop words equal function words? NLTK has a default stop word list. I used LIWC for the list/lexicon look up. I think it's more safe.
(May 07 '11 at 03:00)
Dexter
I believe those two words are usually used interchangeably. Wikipedia indicates that some function words are stop words, but not all: http://en.wikipedia.org/wiki/Stop_words
(May 07 '11 at 09:49)
Robert Layton
Robert, Yes. Hence, I guess a pre-defined function word list is apt for my task.
(May 11 '11 at 10:57)
Dexter
|