|
Hi, I want to classify the sentiment on short sentences. For e.g, " India m/ " , "Sachin <3" , "Awesummmm", " you rock" . I tried out a few things :
I am having two doubts : Which algo should i go for ? I have tried out NaiveBayes and LBFGSB till now. Is there a combination which will work better ? Should i use two three classifiers and then take a majority ? The aim is to classify short sentences, the noisy ones. IN that case, what kind of data should be ideally taken. Only training on facebook comments, that too taken manually, i thought it would not be robust. Thanks.. |
|
Here is my advice (untested, I won't guarantee the results): collect a much larger corpus of tweets and use smileys as positive / negative signal. You should be able to use the search API for finding smiley bearing tweets e.g. https://twitter.com/#!/search/%3A%29 https://twitter.com/#!/search/%3A%28 but because of rate limiting it might be better using the stream API with post-filtering instead. My intuition is that will need a very very large training set to get anything useful. Then use a logistic regression model with a l2 or l1 + l2 regularization. If you are familiar with nltk, install the megam program and use nltk-trainer to fit such as model. Try to experiment with character n-grams of size 2 to 4 in addition to whitespace tokenized n-grams. charngrams might be able to pickup features common to words with varying spelling which is quite common on twitter / facebook data. 1
I am so glad you replied to a question i asked :) . Great slides you shared a few weeks back. I have a 500 million tweet corpus. I did use smileys to filter out the tweets , alongwith some obvious words. But the +ve and -ve files had about a million lines each and nltk-trainer ( used Naive Bayes) hanged my system.I did not try the regression method you mentioned. Will try it today and report back :) . Thanks .
(May 01 '11 at 10:45)
crazyaboutliv
megam should be able to work fine. Otherwise you can try the SGDClassifier of scikit-learn, see the examples folder, there is a couple of examples that deal with document classification.
(May 01 '11 at 10:56)
ogrisel
|
|
Are you trying to do positive/negative/neutral classification or just positive/negative? I actually just completed a project for positive/negative/neutral sentiment classification of tweets with naive bayes classifiers trained on :-) and :-(. This paper is one of the best resources I found for the positive/negative case, which I found to be a pretty straight forward problem, at least in the Twitter domain. If you take the naive bayes using emoticons approach I would recommend training on at least 500,000-1,000,000 of each class. I was also able to improve results by ignoring the prior, p(C), as emoticons don't really provide a good estimate. For starters, we are doing it just positive/negative. But positive/negative/neutral would actually be better. How much data did you use ? Any other details would be great. Let me go through that paper . Sadly, naive bayes did not give good results for us.I should ignore the prior like you did and see.
(May 02 '11 at 08:47)
crazyaboutliv
|
|
An algorithm such as NaiveBayes, by itself, will not solve your problem. Neither will using SVM, k-means, Artificial Neural Networks, or linear regression. You need to focus on the features you select, and find features correlating to sentiment before you worry about modelling those features. While an algorithm will help you find mixtures of features that have a high sentiment, I doubt that there would be much interplay between many features. If I had to do this, I would guess that you are looking for feature correlations by themselves, and therefore need a expert system for rule inference. Something that will return:
Thanks Robert. Yeah , you are right.We initially thought of having a few rules for mainly removing unnecessary words and n-grams but then, could not think how to go about it and the rules also multiplied rapidly. So, we thought that if an out-of-the-box algo could have help us have a decent system, then we could stick to it for now before thinking further.
(May 02 '11 at 09:37)
crazyaboutliv
You can use rule building systems for that. Sorry, I'm no expert, so I can't tell you which ones to use.
(May 02 '11 at 21:47)
Robert Layton
I found a few yesterday after seeing your answer , thanks :)
(May 03 '11 at 01:19)
crazyaboutliv
|