|
I am trying to make a news classifier. I am working on 50 topics at the moment and might add more in the future. For each topic I have at least 1000 examples. In the news stream I want to classify some categories have a higher probability than other (for example, sport >> classical music). Also, some categories are well separated and other quite similar, like film and theatre. What do you think would be the best tools and/or methods to use? |
|
Words are probably best. Take the middle range of words - rank all words from highest to lowest frequency and remove the top and bottom 10%. The remaining words should be a good benchmark. To improve on this, try synonyms. To improve on that, do a google scholar search for the "twenty newsgroups" and "reuters" corpora, and have a look at those papers. Those two corpora are widely used for this type of task. I should also point out - don't worry too much about the exact classification algorithm yet. Just fix it on SVM, or something similar, and after you get good results preprocessing and extracting good features, then change the classification algorithm. I built a list of stopwords by computing the entropy of the vector of probabilities of each word in each class. Basically, if the entropy is higher, the word sees usage over more classes (like the word "the"). This gave me 1% boost in accuracy. I also dropped the low frequency words. I focused my effort on making a good training dataset, but it is kind of biased in some topics. I tried various off the shelf classifiers and all of them gave me similar accuracy scores. I was hoping there were some tools that can handle a large number of classes well. I also have a huge unlabeled dataset (11 million article snippets) and want to use it to boost the accuracy, if possible.
(Jul 27 '11 at 15:09)
Visarga
|
|
Are the categories mutually exclusive? I.e. can a document belong to more than one category? If so I'd recommend to build K binary classifiers (K == number of categories), where each classifier has been trained to predict one class versus all others. If your classes are mutually exclusive use a multi-class classifier such as Multinomial Logistic Regressen (aka MaxEnt) or Naive Bayes. I prefer the former since it usually provides better calibrated probabilities (Naive Bayes tends to overestimate probabilities). You can find further information on the slides of the Text Mining Tutorial by Tong Zhang (He discusses hierarchical text classification too). Some categories, like weather and politics are mutually exclusive, but other, like police and law, overlap. I will try to build N binary classifiers and see how that works out. At the moment I am using a Naive Bayesian classifier with 88% training accuracy and it doesn't do the job in a satisfactory way.
(Jul 27 '11 at 15:04)
Visarga
|
|
I suggest taking a look at Ruslan Salakhutdinov's work on the "replicated softmax" RBM. You can train one one the word count vectors (maybe take the log of the counts) of various documents quite easily and extract features. Then just stick a logistic regression layer on top of it and, if you want, backpropagate all the way through the RBM weights. That being said, Robert Layton makes a good point that you should spend some time preprocessing. But I disagree that you should spend a lot of time engineering features at first unless you are confident you can do it especially effectively. So maybe start by just training a logistic regression classifier on a word count vector preprocessed by removing some words (like non-content words) and possibly transforming (take the log, try tf-idf, normalize and use cosine distance, anything simple you can think of) the counts. |
|
You've got lots of training data, so simple is fine here. Just use words as features, log tf values as feature weights. Log tf weighting means the feature value is 0 if the word j does not occur in the document, and 1 + log(TF_j) is it does, where TF_j is the number of occurrences of the word j in the document. If document lengths vary by more than a factor of 5 or so maybe try a simple length normalization, such as cosine or pivoted. Use a discriminative learner with good regularization such as Bayesian logistic regression or SVMs. Some possibilities are SVM Light and (self-promotion warning) BXR. But before you do any of that, measure what the effectiveness of one human blindly evaluated against another on category assignments is. That's pretty much an upper bound on machine effectiveness, and will tell you if you need to revise your category system. |