5
4

I am using python's scikits.learn library with a linear svm (SVC, since LinearSVC is even slower). I've heard libsvm, which scikits.learn interfaces with, is supposed to be very fast, but I'm finding that my 10000 sample, 10000 feature (dense) vector is taking almost an hour to run!

What can I do to speed this up, short of switching to a different language or (even worse) writing my own svm implementation?

asked Jun 15 '11 at 01:54

Jacob%20Jensen's gravatar image

Jacob Jensen
1914315663


7 Answers:

Originally libsvm and liblinear were designed to work with sparse representations. In scikit-learn though, libsvm was patched to be able to work directly with dense representations too. This hasn't be the case for liblinear. So, internally, scikit-learn always convert dense matrices to a sparse representation when using LinearSVC. This may well be the reason you see such a difference in performance.

answered Jul 30 '11 at 12:18

Mathieu%20Blondel's gravatar image

Mathieu Blondel
119121615

Sweet! This also means that if I take the trouble to re-implement their algorithm myself things should work well, right?

(Jul 31 '11 at 02:19) Jacob Jensen
1

Possibly yes but give the SGD module in scikit-learn a shot too as it can work with dense representations directly.

(Jul 31 '11 at 04:05) Mathieu Blondel

Relating to patching libsvm and not liblinear: There is a patch here: http://www.cs.berkeley.edu/~smaji/projects/digits/ but I have no dataset to test it on. I used it on mnist and saw no difference. Does someone know this patch?

(Sep 18 '11 at 11:40) Andreas Mueller

If I'm not mistaken, MNIST is a sparse dataset (although not that sparse). It'd be better to try it on another dataset (more dense and more high dimensional).

(Sep 18 '11 at 15:56) Mathieu Blondel

That's what I figured... I just didn't know any in SVMLight format that would be suitable and thought maybe someone has some data lying around to try it out.

(Sep 18 '11 at 15:58) Andreas Mueller

Not an answer, but fwiw a couple of runtimes for scikits.learn libsvm / liblinear, just to give an idea of how they vary with kernel. ("The most important parameter is the one you haven't thought of.")

    # added 10 Aug, sgd waaay faster  --
 12 sec  sgd        mnist28 (10000, 784)  tol 0.1  C 1  penalty l2  correct 89.6 %
321 sec  LinearSVC  mnist28 (10000, 784)  tol 0.1  C 1  penalty l2  correct 86.6 %

  94 sec LinearSVC   mnist28 (5000, 784)  tol 0.01  C 1  correct 90.6 %
 341 sec SVC linear  mnist28 (5000, 784)  tol 0.01  C 1  correct 93.9 %
1353 sec SVC rbf     mnist28 (5000, 784)  tol 0.01  C 1  correct 84.3 %
     # added --
 174 sec LinearSVC   mnist28 (5000, 784)  tol 0.001  C 1  correct 90.6 %
 319 sec SVC linear  mnist28 (5000, 784)  tol 0.001  C 1  correct 93.9 %
4059 sec SVC rbf     mnist28 (5000, 784)  tol 0.001  C 1  correct 84.3 %

The data is Mnist from http://deeplearning.net/data/mnist/mnist.pkl.gz, train the first 5k, test the last 5k of the 70k vecs. My versions: python 2.6.4 numpy 1.6.0 scikits.learn 0.8.1, mac ppc.

http://peekaboo-vision.blogspot.com/2010/09/mnist-for-ever.html gets to 98 % -- rbf kernel, default tol .001 (LinearSVC .0001), runtimes unknown.

To see how runtimes vary with number of features, one could just use np.repeat( X, repeat, axis=1 ).

Again fwiw, here's the code snippet. (It took me a while to realize that, while LinearSVC has multi_class=, SVC is automatically multiclass if y has > 2 values; but it's not clear which multiclass algorithm it then uses.)


if algo == "sgd":
    clf = linear_model.SGDClassifier( seed=seed )
elif algo == "LinearSVC":
    clf = svm.LinearSVC( multi_class=True, tol=tol, C=C )
        # penalty = "l1", dual = (penalty == "l2")
else:
    clf = svm.SVC( kernel=algo, tol=tol, C=C, gamma=gamma, shrinking=True )  # linear, rbf
fit = clf.fit( X, y )
ypred = fit.predict( Xtest )
correct = (ypred == ytest) .mean() * 100

t = "%.0f sec %s  %s %s  tol %.2g  C %.2g  correct %.1f %%" % (
    time() - t0, algo, source, X.shape, tol, C, correct)
print t
print >>tty, t
svmutil.pconfus( ytest, ypred, nclass )

answered Aug 09 '11 at 10:54

denis's gravatar image

denis
2812812

edited Aug 10 '11 at 11:17

About my blogpost: I used a RBF kernel .. which I didn't mention explicitly, but I gave the gamma ;) (fixed now). The results don't really compare, though, since I used all the 60000 training images for training, not only the first 5000. The tolerance was the standard tolerance of the libsvm command line interface, which is 0.001. I don't know the runtime any more but It was quite long as you can imagine ;)

btw, SVC uses 1 vs rest multi-class classification afaik.

I find your results a little misleading, as far as the accuracy is concerned (the runtime comparison seems pretty reasonable). To compare different kernels in terms of accuracy I feel that you have to do grid searches to find optimal parameters.

(Aug 09 '11 at 11:01) Andreas Mueller

Andreas, sure, any timing results are misleading, incomplete, ymmv; I just wanted to give a rough rough idea of the large range of runtimes and accuracies here. If some sgd? svm is 10* faster, then that way is up.

(Aug 09 '11 at 12:31) denis

Yeah I (somewhat) agree. The liblinear should be way faster, which can be seen from your experiments.... but the rbf svm should be a lot better - and also a lot slower - than the linear one. And the "better" part is not so clear from your results ;)

(Aug 09 '11 at 12:45) Andreas Mueller

These are in line with my runtimes. At least I'm not doing something super-wrong.

(Aug 09 '11 at 16:39) Jacob Jensen

Jacob, clf = linear_model.SGDClassifier() looks waaay faster. If you would ask about "experience with RFE", I'd put up some numbers on RFE with sgd too.

(Aug 12 '11 at 03:52) denis

I totally agree with Lucas that the data set is pretty large. I strongly recommend to drastically reduce the number of features.

Liblinear is one of the best SVM solvers out there. Compared to SGD it is really user-friendly: it does not require careful hyper-parameter tuning but you need to keep some things in mind (see LIBSVM guide). Most importantly: center your data since SVMs are not scale invariant. (If you haven't done it, this should be the first thing you should try!)

Furthermore, did you use the default parameters for SVC or did you use kernel="linear"? The default kernel of SVC is non-linear (it's a poly kernel). If your data is non-linear it is plausible that SVC converges faster because LinearSVC has a hard time finding a reasonable linear fit (in this case use a smaller C parameter for LinearSVC).

Good luck!

best, Peter

answered Jul 19 '11 at 08:45

Peter%20Prettenhofer's gravatar image

Peter Prettenhofer
5251911

1

The default kernel of scikits.learn.svm.SVC is not poly but rbf (which is slow too, and probably useless in very high dim).

(Jul 19 '11 at 14:33) ogrisel

I'm using libsvm kernel = linear. I have tried liblinear, and have not been satisfied with the results (worse and slower, for some reason). Currently I'm in a prototyping phase, so I'll suck it up and use libsvm, but I'll probably make my own svm implementation later.

(Jul 19 '11 at 15:36) Jacob Jensen

Hey Peter, I have noticed too that liblinear is fairly robusted in terms of setting the C parameter on many datasets. Is their a theoretical justification for this?

(Jul 30 '11 at 06:09) Gael Varoquaux

Peter, Gael, what do we have for big dense testcases, binary / multiclass ?

(Jul 30 '11 at 10:19) denis

I do not know, this is a bit far from what I personally work on. I'd be interested by the answer though.

(Jul 31 '11 at 04:50) Gael Varoquaux

I'm not sure whether this switch is exposed in scikits but liblinear is able to switch between optimising the primal and the dual. You should try try that. Other than this, I don't think there is much that you can do. And I highly doubt that you can write a SVM that is faster than LibLinear. You might try stochastic gradient descent methods for linear classification. I haven't done much with them myself but I think they should be faster. Cheers, Andy

answered Jun 15 '11 at 03:53

Andreas%20Mueller's gravatar image

Andreas Mueller
2686185893

Clearly, go for a library that supports stochastic gradient descent.

(Jun 15 '11 at 19:52) levesque

Stochastic Gradient Descent is implemented in the scikit-learn: http://scikit-learn.sourceforge.net/stable/modules/sgd.html It's probably a good idea to try it.

(Jul 30 '11 at 06:07) Gael Varoquaux

Have you considered using their logistic regression implementation instead? The method is not too different, but it's optimized with SGD.

answered Jun 16 '11 at 05:23

Justin%20Bayer's gravatar image

Justin Bayer
170693045

1

It depends. They are different solvers for the various losses. Both the SVM problem and the logistic regression can be solved using SGD: http://scikit-learn.sourceforge.net/modules/sgd.html it's a matter of changing the loss.

(Jun 18 '11 at 11:27) Gael Varoquaux

Before switching to another library you should try to increase the tolerance of the convergence criterion of your model. For instance, most of the time in scikit-learn this is a parameter called tol (see the documentation of LinearSVC for instance).

As Peter said: don't forget to scale your data. Have look at the chapter on data preprocessing.

You can also try to reduce the dimensionality by transforming you data with:

from scikits.learn.pca import RandomizedPCA
pca = RandomizedPCA(n_components=500).fit(X_train)
X_train_pca = pca.transform(X_train)

# keep the pca instance around to be able transform
# the test set the same way later
X_test_pca = pca.transform(X_test)

Then you can train your SVC, LinearSVC, SGDClassifier or LogisticRegression model on X_train_pca instead. If you want to use SVC with a rbf kernel (the default) it might be worth trying to whiten the data as well:

pca = RandomizedPCA(n_components=500, whiten=True).fit(X_train)
X_train_pca = pca.transform(X_train)

answered Jul 19 '11 at 14:43

ogrisel's gravatar image

ogrisel
498995591

edited Jul 19 '11 at 14:54

I have scaled the data, and I have tried messing with the tolerance. The problem is, lower tolerance significantly lowers accuracy in my case.

(Jul 19 '11 at 15:37) Jacob Jensen

Then dimensionality reduction is probably the way to go. You should also try SGDClassifier with penalty='elasticnet' but you will have to do some grid search for the learning rate.

(Jul 19 '11 at 16:07) ogrisel

I don't have much experience with using dense data like that, but I'm not surprised that it takes a long time to run on a 100,000,000 entry dense matrix! You might try some feature selection to reduce the dimensionality of the input. In my experience, the feature selection in scikits.learn can run out of memory on inputs that large, though you can easily alter the code used here to run in an online fashion to improve its memory efficiency.

answered Jun 15 '11 at 19:18

Lucas%20Wiman's gravatar image

Lucas Wiman
13615

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.