2
2

Is there any literature that examines the choice of minibatch size when performing stochastic gradient descent? In my experience, it seems to be an empirical choice, usually found via cross-validation or using varying rules of thumb.

Is it a good idea to slowly increase the minibatch size as validation error decreases? What effects would this have on generalization error? Am I better-off using an extremely small minibatch and updating my model hundreds of thousands of times? Would I be better off with a balanced number somewhere between extremely small, and batch? Should I scale the size of my minibatch with the size of the dataset, or the expected number of features within the dataset?

I obviously have a lot of questions about implementing minibatch learning schemes. Unfortunately, most papers I read don't really specify how they chose this hyperparameter. I've had some success from authors such as Yann LeCun, especially from the Tricks of the Trade collection of papers. However, I still haven't seen these questions fully addressed. Does anyone have any recommendations for papers, or advice as to what criteria I can use to determine good minibatch sizes when trying to learn features?

asked Aug 27 '13 at 02:31

Phox's gravatar image

Phox
36236


4 Answers:

IMO the point about minibatch is that you are estimating the "population" Error surface with a sample. So you should be estimating things like the variance of your weight updates...think of taking the mean of some quantity- to get a good estimate you will need more samples if your data has high variance than if low. Equally, the point about minibatch, is that a sample average is normally good enough [ law of large numbers...]...ie you are saving computation time, that can be used to do more gradient descent steps. remember standard error [ie the st deviation of average is 1/sqrt(N) standard deviation of your individual values]. So quadrupling the number of points in your minibatch only reduces standard error by 1/2...so you might want to look at some metric relating the absolute values of your batch (ie mean) update to the standard error of mean update [ ie if your mean update is (.01,0.1) and the standard error is ( 1,0.02) then you might want to increase your batch size [ because the estimate is 0.01 +/-1]

answered Aug 27 '13 at 05:22

SeanV's gravatar image

SeanV
33629

That's a good explanation. Also, since initial parameters are usually far from optimal, you can use small minibatches to get close to the optimal (with less computation), and then larger minibatches later for fine tuning.

(Aug 27 '13 at 13:45) Daniel Hammack

To fit daniel's comment within my general statement. you typically only care about relative error of your gradient calculation. Around a minimum, the (mean) gradient is small (so you will need lots of samples to get relative error of say 1%), away from a (local) minimum it is not- and so to keep a rel error of 1% you will need fewer samples.

(Aug 28 '13 at 03:12) SeanV

I think it is a bit too much of a cop-out to just say it is an empirical question. For training neural nets when there a good amount of training data I think it is more of a computational question. Generally try to use the smallest minibatch that takes maximal advantage of the parallelism in the hardware you are using and model you are training. If the hardware you are running on makes an update for your model with a minibatch size of 2X take almost the same time as an update with a minibatch size of X then use the larger minibatch. If you are using BLAS operations on the GPU or CPU then you want to be doing something that has performance characteristics of a matrix-matrix multiply, not a matrix-vector multiply. When training a neural net on a single CPU or GPU I increase the minibatch size until the computation becomes the bottleneck and use the smallest batch size that achieves this.

However, when you have a small training set and you really want to get to a training error minimum (I find I am rarely in this situation) you might want to use larger minibatches. Usually I want to use the smallest minibatch possible that isn't wasting compute power.

answered Aug 29 '13 at 00:12

gdahl's gravatar image

gdahl ♦
341453559

I think the consensus among practitioners is that minibatches often help, they help by a different amount in different stages of learning, and the best minibatch size is something which should be determined empirically, as it's a function of the learning rates and many other parameters which are also tuned.

answered Aug 27 '13 at 14:35

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Are there any heuristics that can help with choosing a range of minibatch sizes before using a parameter search? SeanV pointed out a few considerations. I'm wondering if you have anything more to add.

(Aug 27 '13 at 21:52) Phox

@Phox: Take a subsample!! namely the qualitative behaviour of the error surface doesn't change with number of samples, so do empirical investigations on a small enough subset for the search to be practical ( as recommended by Bottou http://cilvr.cs.nyu.edu/doku.php?id=courses:bigdata:slides:start or his webpage Tricks of trade ... reloaded]

(Sep 01 '13 at 15:17) SeanV

In the case of convex objectives, there has been some theoretical analysis of the effect of using a sequence of increasing batch sizes on the convergence rate of the method. When you are far from the solution, a small mini-batch is sufficient to make fast progress towards the solution, but as you approach the solution you need to increase the mini-batch size (i.e. reducing the variance of the estimator, or decreasing the error in the gradient) to continue making fast progress. Unfortunately, how to choose the sequence of batch sizes in practice is still somewhat of an art, even in this simplified setting.

See:
Friedlander and Schmidt, "Hybrid Deterministic-Stochastic Methods for Data Fitting". SIAM Journal of Scientific Computing.
Byrd et al., "Sample Size Selection in Optimization Methods for Machine Learning". Mathematical Programming.

answered Sep 01 '13 at 00:43

Mark%20Schmidt's gravatar image

Mark Schmidt
11

edited Sep 01 '13 at 00:46

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.