|
I have a binary classification problem, that turns out to be quite challenging. I tried several classifiers in weka, just to compare, but hardly any could beat logistic regression. My expectations were that SVM should be able to get better results, but it could hardly get ~1% better overall accuracy (if at all), after properly doing cross-validation. Is there anything else I could do, or it's not very uncommon for LG to be on comparable with SVM? In addition, I would like to maximize on accuracy for limited recall, (e.g. 20%). In this case SVM could be ~5% better. What other algorithms perform well at smaller recall ranges? Between I was using scikits.learn for my experiments, and I'm very happy with the performance. Edit: Data set consist of 4000/4000 positives/negatives, and I have 50 features. Also the features are somewhat sparse, and mainly concentrated around zero. Anyway, can I conclude that one can get only so much from the data if the performance of the LG and SVM don't differ that much? All training samples are used as support vectors. Is that a symptom that the data is really bad? Here are the results for the best C and gamma if that can tell you anything. precision recall f1-score support
avg / total 0.63 0.63 0.63 4000 |
|
Logistic regression and SVM only solve very slightly different variations of essentially the same problem. Kernels have issues with capacity control, as all learning is done in terms of dot products between items, so you can never directly learn things like "this feature doesn't matter" (it takes many, many support vectors for this to happen even using the quadratic kernel). I know this is frowned upon in some circles, but have you tried running a neural network to see what happens? Theano has good, fast, implementations which are really easy to set up and test. I tried the multi-perceptron in weka, but that didn't give me much. I'll try that thanks for the suggestion. At least I could do much better feature selection by looking at the trained weights of the NN.
(Sep 12 '11 at 13:54)
Oliver Mitevski
1
Alex, do you have a reference backing the statement 'you can never directly learn things like "this feature doesn't matter" (it takes many, many support vectors for this to happen even using the quadratic kernel)'? I fully agree with you, but I'd love to see some kind of formal proof, or at least a well-conducted experiment.
(Sep 20 '11 at 21:53)
Gael Varoquaux
@Gael: I unfortunately don't know any references, but if you imagine projecting everything to the expanded feature space and at the same time representing the training vector as a linear combination of examples you'll see that (1) each feature in the original space will be a part of a lot of features in the extended space and (2) by linear algebra it will take a lot of training examples to drive all those weights down to zero while keeping the others pegged to their sensible values. Another way of looking at this is that the useless features can make up a big share of the dot product of a test sample with any given training sample, so you'll need to average over a lot of training samples to cancel this effect out.
(Sep 20 '11 at 21:58)
Alexandre Passos ♦
On the other hand, maybe andre ng's paper on rotation invariance can be helpful here: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.9860&rep=rep1&type=pdf
(Sep 20 '11 at 21:58)
Alexandre Passos ♦
|
|
All your data-points being support vectors is a symptom of some kind of problem. One possible reason for this is the fact that your features are sparsish, mostly near 0. This suggests that you have long tail or at least high kurtosis feature distributions. This can be a problem for the RBF kernel since, or any method that uses a euclidean metric on the feature space. If you wan to continue with SVMs I would recommend using nu-SVM rather than the usual C-SVM mode (this is available in libsvm & in the R svm packages) and setting nu=0.5, This will force only half your data to be support vectors & implicitly adapt C to make that happen. With the linear kernel there is nothing left to tune, (nu=0.5 is usually a reasonable setting, and even is you do tune it, at least you know what range to tune in) It should be no worse than logistic regression. If you continue with the RBF kernel then you only need to optimize g. With your data you should search a want to search a wider range of g than usual. I would be more inclined to go with polynomial kernels though. However ensembles of classification trees are probably more appropriate for your data. In particular, they are invariant wrt monotonic transformations of their input features, so they do not assume any geometry on the data and have no problems with long tailed data. If you are concerned about overfitting (you should be) use bagging rather than boosting. Random forests would be the first thing to try. The R package should handle to your data easily. The other thing I would recommend is doing some exploratory analysis & visualization to get a better feel for what might be possible. Ggobi is a nice tool for that. Also if you use the R randomForest package, you can train with proximity=true & then apply the MDSplot function to your model. This can give you some insights into the data. |
|
As far as I know, logistic regression is usually comparable with linear SVMs. If you use an RBF kernel with properly scaled data and cross-validated C and gamma, I'd expect it to be better. yes, that's what I would expect too. I used RBF kernels, with cross-validation over C and gamma. Also the data is properly normalized.
(Sep 12 '11 at 09:17)
Oliver Mitevski
maybe your data is linearly separable ;) or your training set is to small...
(Sep 12 '11 at 09:21)
Andreas Mueller
Oh I wish :). That would be too easy. Anyway I added some description about the dataset. But maybe that's the maximum I can get from this data set.
(Sep 12 '11 at 09:32)
Oliver Mitevski
SVM and Logreg are just different loss functions. You can use either with kernels (see Kernel logistic regression and the import vector machine, Zhu and Hastie, NIPS 2002), but no one ever seems to use it.
(Sep 12 '11 at 17:50)
Vicente Malave
The main reason why almost no one uses logistic regression with kernels is that as logistic regression is dense in the dual every single training point would technically be a support vector, which would make the classifier prohibitively costly to apply in most settings (as to classify every test sample a full pass through the training set would be necessary). This naively is an order of magnitude slower than SVMs, where usually a fixed fraction (depending on C) of the training set is used as support vectors.
(Sep 12 '11 at 20:38)
Alexandre Passos ♦
@ Vincent: Because nobody uses kernelized logistic regression, I was pretty sure Oliver meant linear logistic regression ;) @Alexandre: Do you have a reference on that?
(Sep 13 '11 at 04:28)
Andreas Mueller
2
Not a real reference, but note that the optimization problem of logistic regression depends only on dot products between w and training points. If you use the kernel trick and keep one alpha variable for each training point, and let w dot x = sum_i alpha_i K(t_i, x), the gradient of the classifier in terms of alpha is easy to compute and hence you can solve the optimization problem very easily. However, unlike SVM (this is a property of the hinge loss, more specifically the fact that it minimizes something like an l1 relaxation of the 0-1 loss, and l1 relaxations lead to sparsity) there is no reason here for this grdient to be sparse, so in practice you will find that the solution has nonzero alphas for all training points, which is why it should be slow.
(Sep 13 '11 at 06:53)
Alexandre Passos ♦
showing 5 of 7
show all
|
|
SVMs and logistic regression often have quite similar effectiveness. If you follow the suggestions above and put a lot of work into tuning the SVMs you can probably get effectiveness of SVMs up. But the same would be true if you put a lot of work into tuning logistic regression (e.g. by using appropriate priors, or even kernels as someone mentioned). You mentioned that you're interested in maximizing accuracy at roughly 20% recall. If the training and test sets have similar distributions (e.g. both are random samples from the same population), then this can be done by choosing the threshold that gives 20% recall on the training data. However if the distribution of inputs is different on the test set, then a fixed threshold does not accomplish this. In that case, logistic regression has the advantage that you can use the posterior probabilities to find the test set threshold that gives approximately the recall you desire. |
|
With so few feature and comparatively a good enough number of samples you could try to expand your features with the feature pairwise products (while also keeping the original 50 features) to get 50 + 50 ** 2 = 2550 features and then use a linear logistic regression on this (don't forget to cross validate to find the value of the regularization parameter since you will now have a much higher risk of overfitting). Doing feature product might help logistic regression find a non linear relationship that separate better the data. I have no intuition on whether that would give better results that SVM with RBF kernel or not though, should be more like SVM with polynomial kernel (but probably faster to train). It's very much like SVM with polynomial degree==2, so I wouldn't expect any improvement here. I actually tested SVM with polynomial kernel, but didn't give good results. Probably I would need to preprocess the data better, before applying any algorithms.
(Sep 13 '11 at 05:54)
Oliver Mitevski
This is not exactly like polynomial kernel since I am talking about "intra-sample" feature products while polynomial kernels is the about inter-sample kernel matrix. Also feature binning might help too: e.g. after centering your features treat feature with positive and negative features as independent features (one is zero while the other has the original value). In that case the bin size is 2, this can be generalized to larger sizes. We don't have any utility to do that in scikit-learn yet but should be easy enough to implement manually.
(Sep 13 '11 at 06:04)
ogrisel
Is there a kernel to corresponds to feature products? I like the idea of using feature products to model bilinear interactions, but I really don't like the quadratic growth in number of features...
(Sep 13 '11 at 21:24)
digdug
1
@digdug: It's conveniently called the quadratic kernel, or degree 2 polynomial kernel. It does precisely what you think it does (i.e., has implicit features for all products of features).
(Sep 13 '11 at 22:19)
Alexandre Passos ♦
|
|
If the features are pretty noisy, you might want to at least consider ensemble learning methods as an alternative to a single classification algorithm. They tend to improve the classification accuracy of the underlying base learner in the presence of feature noise, as suggested in this paper and this one. One of the biggest downsides to ensemble learning methods is that they are rather susceptible to over-fitting your training set, so keep that in mind when choosing a suitable training/test split. Anyway, this is just a thought that might help improve some of your results, since you mentioned feature noise. |
|
Oliver i tried the same binary classification problem on text and using both SVM and Logistic Regression based classifier. The data consisted of a particular relation and its contextual bag of words features. My data was sparse (of the order 42K) and i had about 1.5M instances. My AUC score was 0.98 and i almost reached the same accuracy with both SVM and Logreg. Some of the recent literature (see KYLIN or Histogram of Gradients which uses a similar idea) and this should not be a problem. However as others pointed out it would be interesting to change the kernel to RBF and see changes. However it would be good to try some regularisation (would be interesting to see results on sparsity) Your training set seems too easy then if you could get 0.98 AUC.
(Sep 28 '11 at 04:19)
Oliver Mitevski
|
Can you tell us a bit more about the nature of the data (how it was collected) or even make it available online or is it a secret?
Have you tried to do manual annotation on a sub sample to try to evaluate the difficulty of the task? Maybe you just have label noise.
I would like to share it, but that might not be possible (it's not mine). It's basically a gender dataset, collected for internet users. The labels are for sure not noisy, but the features are very much so. The features are topics that the users are reading. There's the question if the assumption is solid, that these kind of features are related to the gender.
@Oliver: then it is highly likely that there is simply no way to reliably predict gender from this kind of data (as in, the bayes error is nonzero), and using a higher capacity model will just overfit.
@Alexandre, yes that would be the conviction for this classification problem, though I got some improvement by passing the features though tanh, in order to spread them more evenly. Because as I mentioned, they were really concentrated around the zero. So I'll look into more preprocessing, maybe it's not totally hopeless.