0
1

Hi,

I am trying to classify YouTube videos into two pre-defined categories using user-generated content(UGC) as a proxy. My dataset contains 373 videos(tagged by three annotators) in the positive class and 373 videos in the negative class. I have extracted the UGC(Description, Title, Comments, Tags etc.) for these videos using the Python YouTube API. I am using NLTK-Trainer to tabulate results and perform 10-fold cross validation. I obtain very good results(about 95% accuracy) using Maximum Entropy Classifier and the bag-of-words(unigrams) model. These results are contradictory to what has been observed in literature. I am not sure how to proceed.

asked Apr 22 '11 at 17:12

Dexter's gravatar image

Dexter
416243438

I think the best way to proceed is to gather more data and see if these results hold up. ~750 examples seems far too few for big conclusions. If the results do hold up, there might be some bias in the data that makes this problem too easy---if so, figure it out by purposefully crippling your classifier in different ways (removing features, changing the optimization, etc) and see what is really necessary for it to have the performance it is indeed having. When you figure that out, it might either be something interesting, in which case you should probably publish it, or most likely some peculiarity in this data that makes it easy, in which case you should move on.

(Apr 22 '11 at 21:23) Alexandre Passos ♦

Alexandre, Thanks for the reply.

Unfortunately I can't increase the size of the data due to labor constraints (annotators to tag videos) and may take some time too. The positive class consists of hate India videos while the negative class consists of videos by searching the term "India" on YouTube. I believe I have tried to confuse my classifier by selecting such a negative class viz. data driven approach.

The best results were reported on DecisionTree and Maxent classifiers in NLTK. The feature set is simple unigrams with filtered stopwords. My first reaction to the results was that my data is linearly separable in high dimensional space. How can I cripple my classifier?

(Apr 23 '11 at 05:07) Dexter

2 Answers:

there is almost certainly a tell in the data, a particular feature value or values that is giving one class away. why not get some unlabeled examples, apply the model you trained, and sanity check the classifications? i would also look at the contribution of each term to the class label, or, alternately, the likelihood ratio given the presence of each term. there may be a term that is way off the charts.

answered Apr 23 '11 at 05:32

downer's gravatar image

downer
54891720

Downer,

The following explanation rings out from the "show_most_informative_features" function in NLTK. There are a few words which (in terms of NaiveBayes) are 30-40% more likely to belong to the positive than the negative class. Similarly, for Maxent(logistic regression). Some terms have the likelihood ratio value as +1.7. What do I do next?

I have divided by data into 75-25 train/test split, which essentially means I do have unlabeled examples.

(Apr 23 '11 at 05:38) Dexter

Your next task is to find out why you got the results you did, and why they differ from the literature. As you used BOW, you can find words which work best for each class, the key discriminators. You may find that only a few terms are necessary or you may find that you need many terms to make a distinction: each would be a result.

answered Apr 24 '11 at 09:15

Robert%20Layton's gravatar image

Robert Layton
1625122637

edited Apr 24 '11 at 09:17

Robert, Thanks for the reply. The problem may be in the construction of the negative class. he positive class consists of hate India videos while the negative class consists of videos by searching the term "India" on YouTube. I believe I have tried to confuse my classifier by selecting such a negative class viz. data driven approach.

How do I confuse my classifier ?

(Apr 24 '11 at 09:19) Dexter
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.