How important is the batch-size for SGD? A batch of size > 1 is favorable for concurrent implementation, but are there any other benefits / drawbacks of small/large batch sizes? For the non-concurrent case, is there any reason to prefer a batch of size > b? Does this hold also for the regularized case?

asked Nov 04 '10 at 12:00

yoavg's gravatar image

yoavg
69671825


One Answer:

Very large batch sizes have been shown to lead to slower convergence (forgot the publication about this) for reasons that are similar to why SGD often works better than batch gradient descent. On the other hand, they are less noisy than single data point based updates and of course computationally more efficient when things can be ran in parallel. Most often I see batch sizes that vary from 10 to 100 datapoints but this might be quite dependent on the type of model you are optimizing. Like finding a good learning rate, it is unfortunately often a matter of trial and error.

answered Nov 04 '10 at 12:46

Philemon%20Brakel's gravatar image

Philemon Brakel
153092244

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.