I have a highly unbalanced training/test set (train: 149/97000 test: 245/218000) I have trained a RBF SVM classifier after grid-searching for the best parameters C and gamma. The accuracy is 99ish percent but the precision/recall are only slightly above 50%. Before starting the next part of my work I would like to ask for your ideas about which one of the most common classifiers will do better for such a skewed corpora.

SVM vs KNN vs Logistic Regression vs Neural Networks

P.S. I am learning this techniques day by day so please, be gentle! :)

asked Aug 26 '13 at 14:14

Alvin%20Pastore's gravatar image

Alvin Pastore
1111

first off, are you using a software package ( eg scikit-learn [which version, 14?]...)? If you have only ~100 positive points then you need the simplest model possible... logistic/linear SVM regression (and you can add various regularisation terms) to speed up analysis better to downsample negative ( ie take random sample of negative class AND reweight negative class in training by corresponding fraction)

(Aug 27 '13 at 08:40) SeanV

I am currently using libsvm and I also tried liblinear but there's no big change in the results. What do you mean by "add various regularisation terms"?

I tried playing with the weights to penalize misclassified true positive but it helps to a certain extent, after that the loss in precision is too high.

(Aug 27 '13 at 13:41) Alvin Pastore

One Answer:

I don't know libsvm etc directly, I use them through scikit learn. But I think you ought to try logistic regression [which I believe you can get from liblinear]. The key point is that it is directly optimising the probability estimate, rather than just trying to minimse classification error. You might want to use Area under ROC curve in your gridsearch [ since there is no arbitrary threshold for classification metrics]

for logistic regression [for speed reasons/faster grid search etc, you might consider subsampling the large class and reweighting the large class ] but this may be a pain in order to get the cross-validation working [ arguably you want to subsample the training data but not the validation data!]

in libsvm etc, you need to get out the decision function ( ie the continuous value -infty to +infty, rather than the class [-1 or 1]) so that you can choose/vary the threshold at which to classify in A/B class ( and suggest again use Area under Roc curve in your grid search). There are versions of SVM [eg SVMperf http://svmlight.joachims.org/ ] which" lets you directly optimize multivariate performance measures like F1-Score, ROC-Area, and the Precision/Recall Break-Even Point". I have yet to use them.

answered Sep 01 '13 at 15:46

SeanV's gravatar image

SeanV
33629

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.