I am trying to build a twitter movie sentiment analysis using the naive bayes classifier. What it will do is give you percentage ratings of a movie. For example: Mission Impossible III is 70% positive and 30% negative... this is based on twitter. I just want to create something really basic and use regular naive bayes. I don't want to win over fflick or some other big startups that has already done this. This is just for my learning experience.

The problem is that I will need to create a model before running the classifier on an unknown movie data, so I would need a set of tweets that represent movies that it is positive and several movies that shows a movie is negative. Where can I get this information from?

To make it better I would also need to apply some filter to the tweets. What is a good way to filter it? I've read this paper http://www-nlp.stanford.edu/courses/cs224n/2009/fp/3.pdf so far to get the idea.

asked Mar 25 '11 at 15:18

Alex%20Hernandez's gravatar image

Alex Hernandez
4081015

edited Mar 25 '11 at 15:19

Do you want a strategy to filter tweets or help on how to technically get the tweets. I don't know anything about twitter's API, but I think that your first cut on filtering tweets should be to find tweets that explicitly mention a movie. Once you have tweets that mention movies, you can do sentiment analysis on them.

(Mar 25 '11 at 17:38) Travis Wolfe

I don't there there are any explicit datasets of this out there. (Or at least I haven't heard of any). You may have to find movie titles in tweets yourself, somehow. If you don't mind old movies discussion, go here: http://snap.stanford.edu/data/twitter7.html .The dataset is not available, but the links are still working, and are commented in Source for the site.

(Mar 26 '11 at 20:32) karpathy

I know how to pull out the tweets. My question is with regards to how to perform the classifier if I don't have a model yet. Even though I can pull out the tweets about movies, I don't know whether that tweet is positive or negative... in other words it's unsupervised learning. I need this information if I want to use naive bayes

(Mar 27 '11 at 12:20) Alex Hernandez

One Answer:

First keep in mind that probability estimates from naive bayes are very ill-calibrated, and tend to cluster around 0 and 1, so it's a good idea to use a calibrated classifier, such as logistic regression, if you want to report credible averages. This said, you would need to build a classifier. I would suggest that you create a dataset by following Paul Mineiro's suggestion on his blog to search for a movie and any tweet with ":)" is assumed positive and any tweet with ":(" is assumed negative. I'd also ignore the words of the name of the film, to avoid needlessly biasing the classifier towards assigning high "goodness" to the names of good movies and "badness" to the names of bad movies.

answered Mar 25 '11 at 20:12

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

You're saying that I need to build my own model, by starting to use a class label of tweets that :) is a positive and :( is negative?

(Mar 27 '11 at 12:23) Alex Hernandez

As a first step to try to avoid having to label large amounts of data, yes. I suggest you assume that tweets with the name of the movie and :-) are positive and name of the movie and :-( or :-/ are negative

(Mar 27 '11 at 12:24) Alexandre Passos ♦

what would be the next more advanced of that?

(Mar 27 '11 at 13:15) Alex Hernandez

The next step after that would be to find words that are generally positive or negative and then the next step after that would be to automatically score tweets based on word combinations - i.e. "not bad" is not a negative score, but it isn't exactly positive either.

(Mar 28 '11 at 03:04) Robert Layton

How would you tweak a regular/normal naive bayes to get a better result? Or is it even better to use SVM directly? I heard that SVM is better from movie classification than Naive Bayes

(Mar 29 '11 at 19:11) Alex Hernandez
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.