|
Hello all, I have 3 classes of texts for my classification task. In the first class I have SPAMs, in the second class I've legitimate ham texts and in the third class I've legitimate texts that contains SPAM message in it. As a starting point I tested naive bayes on my dataset, but I got really poor results. Most of the the texts in third class were classified as SPAM. I'm planning to test boosted decision trees and SVM with gaussian kernels feeding them with active learning. Which techniques, classifiers, papers would you recommend for me? (Also I'm pretty new to text classification) |
|
I'd try using a topic model to tag the spammy-words. You can have two topics (ham and spam) or three topics (ham, spam, and background) and a ternary decision variable that decides if a document contains only ham and generic or spam and generic or all three topics. You can do this with gibbs sampling with an algorithm that is a mixture of the naive bayes sampler and the LDA sampler. In general I think you want a structured approach that will explicitly tag the spammy and hammy sections of your text. So you might have good success with a per-word CRF and a threshold of how many spammy words are necessary for you to label a document as half-spam and spam. Thanx very much, Alexandre CRF was the algorithm I was looking for :D
(Sep 11 '10 at 17:32)
cglr
|