6
5

What are the advantages/disadvantages of batch vs. stochastic gradient descent? When should you use one vs. the other?

asked May 01 '12 at 17:14

grautur's gravatar image

grautur
961122328


4 Answers:
10

Stochastic gradient descent has a convergence rate which is independent of the size of your dataset and is thus adapted when you have a huge or even infinite dataset. There are two downsides to it:

  • its slow convergence rate, which is O(1/t). This is usually not a problem in a learning context as the minimax rate on unseen data (usually referred to as "test error") is also O(1/t). Thus, it is not beneficial to get a faster convergence rate on the training set as this will not be translated into a better convergence rate on the test set. Léon Bottou has some papers on the topic, for instance http://leon.bottou.org/papers/bottou-bousquet-2011.

  • its sensitivity to the two parameters, the learning rate and the decrease constant. This issue has not yet been solved, except by some heuristics. On the other hand, batch methods can use additional tools to make learning faster, like line searches. A great toolbox which implements all these tools for batch methods is minFunc by Mark Schmidt: http://www.di.ens.fr/~mschmidt/Software/minFunc.html .

answered May 01 '12 at 18:48

Nicolas%20Le%20Roux's gravatar image

Nicolas Le Roux
7652912

edited May 02 '12 at 02:39

1

I would like to add that you should almost NEVER use plain full-batch gradient descent. It is a terrible algorithm. One exception would be if you have a very small data set so it doesn't even matter.

(May 02 '12 at 02:09) gdahl ♦
3

One point that I'd like to add to Nicolas excellent answer is that the argument for stochastic gradient descent mostly builds on learning aspects, in contrast to optimization aspects.

If you really want to optimize a function (i.e. training error), batch GD (usually with additional fancynesses like line search or second-order methods) is preferable.

When you want to learn, i.e. minimize expected loss on an unseen training set, then SGD really is a good idea, as argued by Leon Bottou and Nicolas in his first bullet point.

(May 02 '12 at 03:08) Andreas Mueller

If you look at your optimization as a dynamic process, you could say:

  • Batch Gradient Descent: You need to run over every training example before doing an update, which means that if you have a large dataset, you might spend much time on getting something that works.

  • Stochastic gradient descent, on the other hand, does updates every time it finds a training example, however, since it only uses one update, it may never converge, although you can still be pretty close to the minimum.

In conclusion, if your data is large, you might want to use Stochastic Gradient Descent.

Andrew Ng explained it better in this set of notes (page 5)

answered May 01 '12 at 23:22

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128

I think Nicolas has a great answer. To extend on it, in general you can recover some of the benefits of Batch Gradient Descent by using a small collection of data points to estimate your gradient direction.

There are many papers on this. http://arxiv.org/pdf/1012.1367.pdf is a recent example of how it is used in distributed systems.

answered May 02 '12 at 03:47

zaxtax's gravatar image

zaxtax ♦
1051122545

1

We also recently proposed a stochastic method which enjoys the convergence rate of batch methods, i.e. exponential. The culprit is that you need to be able to fit all your gradients in memory, but you can use minibatches to only store one gradient per minibatch. We plan to provide simple ways to do line searches with this method: http://arxiv.org/abs/1202.6258

(May 02 '12 at 03:57) Nicolas Le Roux

I meant "caveat" and not "culprit"...

(May 02 '12 at 10:33) Nicolas Le Roux

I think the best technique is to use mini-batch with a varying batch size, B. Increase the batch size as learning progresses. This is similar to an Annealing Schedule in Simulated Annealing.

answered Nov 21 '12 at 11:44

Paul%20Donnelly's gravatar image

Paul Donnelly
1

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.