|
What are the advantages/disadvantages of batch vs. stochastic gradient descent? When should you use one vs. the other? |
|
Stochastic gradient descent has a convergence rate which is independent of the size of your dataset and is thus adapted when you have a huge or even infinite dataset. There are two downsides to it:
1
I would like to add that you should almost NEVER use plain full-batch gradient descent. It is a terrible algorithm. One exception would be if you have a very small data set so it doesn't even matter.
(May 02 '12 at 02:09)
gdahl ♦
3
One point that I'd like to add to Nicolas excellent answer is that the argument for stochastic gradient descent mostly builds on learning aspects, in contrast to optimization aspects. If you really want to optimize a function (i.e. training error), batch GD (usually with additional fancynesses like line search or second-order methods) is preferable. When you want to learn, i.e. minimize expected loss on an unseen training set, then SGD really is a good idea, as argued by Leon Bottou and Nicolas in his first bullet point.
(May 02 '12 at 03:08)
Andreas Mueller
|
|
If you look at your optimization as a dynamic process, you could say:
In conclusion, if your data is large, you might want to use Stochastic Gradient Descent. Andrew Ng explained it better in this set of notes (page 5) |
|
I think Nicolas has a great answer. To extend on it, in general you can recover some of the benefits of Batch Gradient Descent by using a small collection of data points to estimate your gradient direction. There are many papers on this. http://arxiv.org/pdf/1012.1367.pdf is a recent example of how it is used in distributed systems. 1
We also recently proposed a stochastic method which enjoys the convergence rate of batch methods, i.e. exponential. The culprit is that you need to be able to fit all your gradients in memory, but you can use minibatches to only store one gradient per minibatch. We plan to provide simple ways to do line searches with this method: http://arxiv.org/abs/1202.6258
(May 02 '12 at 03:57)
Nicolas Le Roux
I meant "caveat" and not "culprit"...
(May 02 '12 at 10:33)
Nicolas Le Roux
|
|
I think the best technique is to use mini-batch with a varying batch size, B. Increase the batch size as learning progresses. This is similar to an Annealing Schedule in Simulated Annealing. |