I'm currently trying to classify Tweets using the Naive Bayes classifier in NLTK. I'm classifying tweets related to particular stock symbols, using the '$' prefix (eg: $AAPL). I've been basing my Python script of off this blog post: Twitter Sentiment Analysis using Python and NLTK . So far, I've been getting reasonably good results. However, I feel there is much, much room for improvement.
In my word-feature selection method, I decided to implement the tf-idf algorithm to select the most informative words. After having done this though, I felt that the results weren't that impressive.
I then implemented the technique on the following blog: Text Classification Sentiment Analysis Eliminate Low Information Features. The results were very similar to the ones obtained with the tf-idf algorithm, which led me to inspect my classifier's 'Most Informative Features' list more thoroughly. That's when I realized I had a bigger problem:
Tweets and real language don't use the same grammar and wording. In a normal text, many articles and verbs can be singled out using tf-idf or stopwords. However, in a tweet corpus, some extremely uninformative words, such as 'the', 'and', 'is', etc., occur just as much as words that are crucial to categorizing text correctly. I can't just remove all words that have less than 3 letters, because some uninformative features are bigger than that, and some informative ones are smaller.
If I could, I would like to not have to use stopwords, because of the need to frequently update the list. However, if that's my only option, I guess I'll have to go with it.
So, to summarize my question, does anyone know how to truly get the most informative words in the specific source that is a Tweet?
EDIT: I was wondering, for TF-IDF, should I only be cutting off the words with the low scores, or also some with the higher scores? In each case, what percentage of the vocabulary of the text source would you exclude from the feature selection process?
This is my stop word list:
answered Feb 16 '12 at 15:35
First off, interesting project.
You probably know that TF-IDF will cut words based upon their relative local frequency. My hunch is that you are right that TF-IDF will not perform well for this problem.
You may be able to create a really useful REGEX for this situation. For instance, are stock tickers usually capitalized? If so, you could combine a POS-tagger to filter out all non-useful words that are less than 4-5 letters in length (using the REGEX to exclude stock tickers from this subset). This seems like it may be easy to try.
If this doesn't get you the results you are looking for, there are still a lot of other techniques to try, but they are more complicated and would require a different approach than is used within that blog post you mentioned.
answered Feb 17 '12 at 09:34