Hi everybody. At the moment, online learning of linear classifiers seems to be pretty big. Usually sgd is particularly good for big datasets where time is really an issue.

That raises the question: what is the best SGD implementation out there? I am particularly interested in linear SVMs and logistic regression on dense data.

Is anyone aware of any wide comparison of different implementations?

There is Leon Bottou's sgdm, John Langford's vopal wabbit, Pegaos, Theano implementations, SGDClassifier in scikit-learn and probably a lot more out there.

Also: which ones do you use and why? Is anyone aware of any good GPU implementations (which I guess would be somewhat trivially possible with theano)?

This question is marked "community wiki".

asked Nov 23 '11 at 11:17

Andreas%20Mueller's gravatar image

Andreas Mueller
2686185893

edited Nov 23 '11 at 12:20

I think one problem with GPU implementations of online learning is that just parsing data as it arrives and shipping it off to GPU memory is much more expensive than the sparse dot product required to train any linear model by SGD.

(Nov 23 '11 at 12:07) Alexandre Passos ♦

That might be true for sparse models. I'm not sure it is true for dense models. I am particularly interested in the dense case. Maybe I should have mentioned that. Computer vision application often have dense vectors of size 10k-50k.

(Nov 23 '11 at 12:19) Andreas Mueller
2

Not SGD but related: you might also want to consider "Online Passive Aggressive Algorithms" and "Confidence-Weighted Linear Classification".

(Nov 23 '11 at 13:22) Peter Prettenhofer

@Peter: I finally got around to read the paper. Have you tried that? If so, what is your experience?

(Nov 28 '11 at 04:30) Andreas Mueller

2 Answers:

For logistic regression on dense data, a simple GPU implementation is quite fast, as long as you have minibatches of at least 128 cases or so and inputs of at least a couple hundred dimensions. Ideally you would also have a double digit number of outputs too. As a rule of thumb, for GPU implementations for large datasets (ones that are often highly redundant) you should double the minibatch size whenever it multiplies the time cost per minibatch by a factor clearly less than 2, but this is only a guideline. Monitor your error as a function of time and try a couple reasonable minibatch sizes.

A simple (dense) logistic regression GPU implementation is trivial to produce if you have one in python using numpy, you can just use cudamat + gnumpy or theano if you prefer. I have no experience with writing custom CUDA code for more highly optimized ones since the simple python GPU implementation I have already provides a large speedup for me and I think it is a waste of time to try to optimize it further even if I get a factor of 2 or 3 improvement.

This answer is marked "community wiki".

answered Nov 23 '11 at 16:05

gdahl's gravatar image

gdahl ♦
341453559

Thanks for sharing your experience. I wrote an implementation using my labs CUV, which is a more convenient cudamat, I think. I benchmarked it against the sklearn library and it lost. It might be that I used to small minibatches. How big are the datasets you use? I tried to do CIFAR 10, which has 60000 examples in 10 classes. And I used features of 13k dimensions. How did you get to you rule of thumb about doubling the minibatches? Doesn't it take many more passes through the dataset if you use so big batches compared to real online learning?

And maybe one more technical question: does your cudamat implementation do memory transfers asynchronously or does it wait for the batch to be copied to the GPU?

(Nov 24 '11 at 02:53) Andreas Mueller

Also: Do you use higher learning rates with bigger batches?

(Nov 24 '11 at 03:54) Andreas Mueller
1

When using gradient descent approaches, since they are not scale invariant, I measure learning rates based on the average gradient for the minibatch. So what I think of as the learning rate is constant as I change the minibatch. So in other words, the update will be of the form (learningRate/minibatch size) * sum(gradient for each case over the minibatch)

(Nov 24 '11 at 16:08) gdahl ♦
1

Are you comparing to scikits.learn? Does it use minibatches of size 1? What minibatch size did you use in your GPU implementation? In terms of minibatches/second, I think a GPU implementation on minibatches of size 128 with input size 13,000 and output size 10 for logistic regression should be clearly faster on the GPU, even with a simple gnumpy/cudamat implementation. In terms of time needed to get error rate X, it may or may not be faster than something that does pure stochastic training, although for many models and problems in my experience it can be much faster anyway.

(Nov 24 '11 at 16:25) gdahl ♦

Thanks for your answer. To the minibatch size: that's what I think of learning rate, too. I was wondering if you change the constant. Some people have the intuition that larger batches give more stable gradient, which justify a higher learning rate.

Yes, I compared to scikit-learn. It uses real online learning = minbatch size 1.

I used 64 and after your comment tried values up to 8000. When I meant faster, I meant to get the same error rate, which I think is what counts. Clearly, calculating the updates for minibatches of size 128 is faster on the GPU. As you said, whether "learning" (meaning getting the same error) is faster is another question.

It's good to know that you had positive experiences there. Have you ever measured the speedup to get to error X ?

(Nov 25 '11 at 02:57) Andreas Mueller

Incidentally, there is a paper on current NIPS on this topic, "Better Mini-Batch Algorithms via Accelerated Gradient Methods" by Cotter et al. They use @Andreas' criterion of measuring runtime until a certain loss is attained. I'm not so sure about their results, though, since some of their magic seems to be in a learnrate schedule and they do not compare to e.g. Bottous implementation.

answered Dec 20 '11 at 06:09

Hannes%20S's gravatar image

Hannes S
86229

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.