|
I have a bunch of webpages (about 500) that I am trying to classify into two groups (safe for work, and not safe for work). These two categories are defined a priori, so I am assuming I should be using a supervised algorithm. I used the typical chi-squared method for extracting ngrams as features (unigrams, bigrams and trigrams only) and a visual inspection of the features suggested that the features make sense. Each instance consists of scores on each feature where the score is just the count of the number of times that feature/ngram appeared in the page. These scores are extremely sparse. When I run Naive Bayes as a first step, I get poor results. About 28% false positive and 37% false negatives. Other alternatives such as C4.5, linear SVM and logistic regression all return worse classification errors. I have two issues where I can use some advice on what I might want to look at next:
What would you suggest? What other features would you suggest other than word counts? Should I use regularization? |
|
Get more data. Stick with generative models. Try binary term weighting instead of counts. Since "safe for work" should include a large number of pages, you can take new pages that your algorithm classifies as NSFW and then add the ones that are actually SFW as training examples, since it will often be wrong. This will produce some excellent training cases because the new cases are by definition hard to classify for your algorithm and thus you might greatly enhance your data with just a bit of annotation effort. Unless you get a lot more data, you should probably stick to simple word presence/absence features. Question updated. The categories are "Safe for Work" and "Not Safe for Work".
(Mar 12 '12 at 23:31)
Ryan Rosario
why do you say "stick with generative models"? for a classification task, that is not exactly general wisdom.
(Mar 13 '12 at 18:02)
Travis Wolfe
@Travis: I am wondering the same thing. I think he might be referring to text classification, not classification in general.
(Mar 13 '12 at 21:19)
Ryan Rosario
The reason to stick with a generative model is because he has such a small number of documents.
(Mar 14 '12 at 00:50)
gdahl ♦
|
|
I think you should try to add more samples rather than features :) Also, have your tried to evaluate the intrinsic quality of the classification labels by relabeling ~100 samples manually yourself by reading the documents and comparing your own predictive performance to that of the algorithm? If I understand correctly, you mean manually classify some of the pages myself and compare it to the algorithm (in other words, measure how difficult this problem is to classify)?
(Mar 13 '12 at 12:46)
Ryan Rosario
Yes. It might also give you any idea on how the classifier is failing.
(Mar 13 '12 at 14:15)
ogrisel
|
Are you stemming the words? That should help at least a little as far as cleaning up the features. Do you have a rough estimate of how many features are appearing in each document? Also, What are the errors rates when you simply use all unigrams?
Have you tried tuning the penalization parameter for SVMs or Logistic Regression? With an (appropriately tuned) L1 penalty they should both easily beat Naive Bayes.