4
5

Hi,

I want to classify the sentiment on short sentences. For e.g, " India m/ " , "Sachin <3" , "Awesummmm", " you rock" . I tried out a few things :

  1. For starters, i tried out a model trained on movie_reviews ( the one with nltk) but they were not able to classify even short sentences like "i love you","i hate you" .
  2. I took 50 positive and negative comments on facebook manually and then trained a NaiveBayes and LBFGSB model . They performed decent but it is not robust .
  3. I have a twitter dump, i took a pre-defined list of positive and negative words and whichever sentences have that, i took them and then trained on them . I tried with 1000 positive and negative, and 5000 positive and negative . But, both were again not robust ( they fail on kind of obvious ones , for e.g, i like you ) .

I am having two doubts : Which algo should i go for ? I have tried out NaiveBayes and LBFGSB till now. Is there a combination which will work better ? Should i use two three classifiers and then take a majority ? The aim is to classify short sentences, the noisy ones. IN that case, what kind of data should be ideally taken. Only training on facebook comments, that too taken manually, i thought it would not be robust.

Thanks..

asked May 01 '11 at 09:46

crazyaboutliv's gravatar image

crazyaboutliv
1505914


3 Answers:

Here is my advice (untested, I won't guarantee the results): collect a much larger corpus of tweets and use smileys as positive / negative signal. You should be able to use the search API for finding smiley bearing tweets e.g. https://twitter.com/#!/search/%3A%29 https://twitter.com/#!/search/%3A%28 but because of rate limiting it might be better using the stream API with post-filtering instead. My intuition is that will need a very very large training set to get anything useful.

Then use a logistic regression model with a l2 or l1 + l2 regularization. If you are familiar with nltk, install the megam program and use nltk-trainer to fit such as model. Try to experiment with character n-grams of size 2 to 4 in addition to whitespace tokenized n-grams. charngrams might be able to pickup features common to words with varying spelling which is quite common on twitter / facebook data.

answered May 01 '11 at 10:41

ogrisel's gravatar image

ogrisel
398464480

1

I am so glad you replied to a question i asked :) . Great slides you shared a few weeks back. I have a 500 million tweet corpus. I did use smileys to filter out the tweets , alongwith some obvious words. But the +ve and -ve files had about a million lines each and nltk-trainer ( used Naive Bayes) hanged my system.I did not try the regression method you mentioned. Will try it today and report back :) . Thanks .

(May 01 '11 at 10:45) crazyaboutliv

megam should be able to work fine. Otherwise you can try the SGDClassifier of scikit-learn, see the examples folder, there is a couple of examples that deal with document classification.

(May 01 '11 at 10:56) ogrisel

Are you trying to do positive/negative/neutral classification or just positive/negative? I actually just completed a project for positive/negative/neutral sentiment classification of tweets with naive bayes classifiers trained on :-) and :-(.

This paper is one of the best resources I found for the positive/negative case, which I found to be a pretty straight forward problem, at least in the Twitter domain.

If you take the naive bayes using emoticons approach I would recommend training on at least 500,000-1,000,000 of each class. I was also able to improve results by ignoring the prior, p(C), as emoticons don't really provide a good estimate.

answered May 02 '11 at 08:25

alto's gravatar image

alto
2652614

For starters, we are doing it just positive/negative. But positive/negative/neutral would actually be better. How much data did you use ? Any other details would be great. Let me go through that paper . Sadly, naive bayes did not give good results for us.I should ignore the prior like you did and see.

(May 02 '11 at 08:47) crazyaboutliv

An algorithm such as NaiveBayes, by itself, will not solve your problem. Neither will using SVM, k-means, Artificial Neural Networks, or linear regression.

You need to focus on the features you select, and find features correlating to sentiment before you worry about modelling those features. While an algorithm will help you find mixtures of features that have a high sentiment, I doubt that there would be much interplay between many features. If I had to do this, I would guess that you are looking for feature correlations by themselves, and therefore need a expert system for rule inference. Something that will return:

'<3' implies good
soundslike("awe-sum") implies good
'not' negates meaning (not sure of any rule system that allows this - perhaps a research project for another day...)

answered May 02 '11 at 09:27

Robert%20Layton's gravatar image

Robert Layton
1520102337

Thanks Robert. Yeah , you are right.We initially thought of having a few rules for mainly removing unnecessary words and n-grams but then, could not think how to go about it and the rules also multiplied rapidly. So, we thought that if an out-of-the-box algo could have help us have a decent system, then we could stick to it for now before thinking further.

(May 02 '11 at 09:37) crazyaboutliv

You can use rule building systems for that. Sorry, I'm no expert, so I can't tell you which ones to use.

(May 02 '11 at 21:47) Robert Layton

I found a few yesterday after seeing your answer , thanks :)

(May 03 '11 at 01:19) crazyaboutliv
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.