I'm currently trying to classify Tweets using the Naive Bayes classifier in NLTK. I'm classifying tweets related to particular stock symbols, using the '$' prefix (eg: $AAPL). I've been basing my Python script of off this blog post: Twitter Sentiment Analysis using Python and NLTK . So far, I've been getting reasonably good results. However, I feel there is much, much room for improvement.

In my word-feature selection method, I decided to implement the tf-idf algorithm to select the most informative words. After having done this though, I felt that the results weren't that impressive.

I then implemented the technique on the following blog: Text Classification Sentiment Analysis Eliminate Low Information Features. The results were very similar to the ones obtained with the tf-idf algorithm, which led me to inspect my classifier's 'Most Informative Features' list more thoroughly. That's when I realized I had a bigger problem:

Tweets and real language don't use the same grammar and wording. In a normal text, many articles and verbs can be singled out using tf-idf or stopwords. However, in a tweet corpus, some extremely uninformative words, such as 'the', 'and', 'is', etc., occur just as much as words that are crucial to categorizing text correctly. I can't just remove all words that have less than 3 letters, because some uninformative features are bigger than that, and some informative ones are smaller.

If I could, I would like to not have to use stopwords, because of the need to frequently update the list. However, if that's my only option, I guess I'll have to go with it.

So, to summarize my question, does anyone know how to truly get the most informative words in the specific source that is a Tweet?

EDIT: I was wondering, for TF-IDF, should I only be cutting off the words with the low scores, or also some with the higher scores? In each case, what percentage of the vocabulary of the text source would you exclude from the feature selection process?

asked Jan 07 '12 at 18:18

Elliott%20Bolzan's gravatar image

Elliott Bolzan

edited Jan 09 '12 at 20:21


i'm not sure if a stopword list would need frequent updating - stopwords are often closed class words (determiners, prepositions, pronouns, ...) which don't change that often.

(Jan 08 '12 at 10:08) eowl

Ok, I understand. But in that case, I'd have to follow a specific set of stock symbols, because otherwise, the companies will start appearing in the most informative features, and I'll have to add them to the stoplist to prevent bias, right?

(Jan 08 '12 at 10:29) Elliott Bolzan

Obviously, the company names will be highly informative features. I don't see how that's a problem; you can't generalize to classes (companies) you've never seen in training anyway.

(Jan 09 '12 at 09:42) larsmans

Well, if the company has in training a higher number of negative tweets than in practice, the results will be biased, and the company's overall score will be massively downgraded when the classifier is actually used in practice.

(Jan 09 '12 at 18:26) Elliott Bolzan

2 Answers:

This is my stop word list:


answered Feb 16 '12 at 15:35

Vishal%20Goklani's gravatar image

Vishal Goklani

First off, interesting project.

You probably know that TF-IDF will cut words based upon their relative local frequency. My hunch is that you are right that TF-IDF will not perform well for this problem.

You may be able to create a really useful REGEX for this situation. For instance, are stock tickers usually capitalized? If so, you could combine a POS-tagger to filter out all non-useful words that are less than 4-5 letters in length (using the REGEX to exclude stock tickers from this subset). This seems like it may be easy to try.

If this doesn't get you the results you are looking for, there are still a lot of other techniques to try, but they are more complicated and would require a different approach than is used within that blog post you mentioned.

Good luck,

answered Feb 17 '12 at 09:34

Ryan%20Kirk's gravatar image

Ryan Kirk

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.