|
How important is the batch-size for SGD? A batch of size > 1 is favorable for concurrent implementation, but are there any other benefits / drawbacks of small/large batch sizes? For the non-concurrent case, is there any reason to prefer a batch of size > b? Does this hold also for the regularized case? |
|
Very large batch sizes have been shown to lead to slower convergence (forgot the publication about this) for reasons that are similar to why SGD often works better than batch gradient descent. On the other hand, they are less noisy than single data point based updates and of course computationally more efficient when things can be ran in parallel. Most often I see batch sizes that vary from 10 to 100 datapoints but this might be quite dependent on the type of model you are optimizing. Like finding a good learning rate, it is unfortunately often a matter of trial and error. |