As mentioned in the comments on this question, it's important to randomize the order of training data when using stochastic gradient descent.

Now suppose you're taking multiple passes over the data, i.e. using SGD as an external memory learning algorithm. How much additional value is gotten out of using a different randomization on each pass? There's some value, since you avoid the danger of latching on to some idiosyncratic pattern in the first randomization. But is it worth the cost of rerandomizing, particularly for a big data set where that means additional disk accesses?

I've fooled around a bit with this, and there seems to be a small value, but I'm wondering if anyone knows of any literature on this?

asked Dec 06 '10 at 12:22

Dave%20Lewis's gravatar image

Dave Lewis
785162644

edited Dec 06 '10 at 12:34

The main issue is that you don't want to encounter too many similar data points in a row. So if you are classifying digits you don't want to present them like 111122222233333 etc. Unless you have many labels that appear very infrequently, I don't think the algorithm will be sensitive to seeing the same order twice as the model will probably be in a different region of weight space between two passes over the dataset. Especially if it is so large that reshuffling it would be costly. Unfortunately I don't know about literature that addresses this issue...

(Dec 06 '10 at 14:46) Philemon Brakel

If you wanted to try to tackle the cost directly, you could have a separate thread running in the background, if possible with a lower priority, that's preparing the training set for the next run; but you end up storing the set twice for a simple implementation.

That could be optimized, though. Supposing it was an on-line algo, the background thread could periodically work on re-distributing samples across mini-batches that have already been used for the current epoch.

If you're using Matlab, and the algo is on-line, I put the mini-batch arrays in cells. Randomly accessing elements in cells is a lot faster than extracting a portion of a larger array to use as a mini-batch.

-Brian

(Dec 06 '10 at 16:10) Brian Vandenberg

I did find it worthwhile to shuffle the data between successive rounds of online algorithms. If shuffling is very expensive, though, you might save some of the cost by keeping the order of examples but skipping over a different random subset of it in each round.

(Dec 06 '10 at 18:20) yoavg

@Brian, @yoavg - Nice efficiency ideas there!

(Dec 07 '10 at 10:22) Dave Lewis

One Answer:

Léon Bottou seems to have covered (somewhat) this question in an unpublished note called Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms (Bottou, 2009). The gist of his experiment is that:

  • cycling through a pre-shuffled dataset gives better than linear convergence,
  • shuffling the dataset before each pass is more chaotic, but still better-looking than the thing one is "supposed" to do, which is
  • sample each example randomly (with replacement) from the dataset during training.

Note that this is all applicable to training errors and that as far as I know we have no theory for analyzing the first two cases.

This answer is marked "community wiki".

answered Dec 15 '10 at 14:56

Dumitru%20Erhan's gravatar image

Dumitru Erhan
10669

Very interesting paper - thanks! So it looks like shuffling is slightly better than cycling through, but less stable. The near-quadratic convergence with both shuffling and cycling is indeed, as Bottou says, curious! I find myself wondering if stochastic gradient in some accidental way ends up acting like a limited-memory quasi-Newton algorithm.

(Dec 15 '10 at 20:32) Dave Lewis

I have the same experience. Interestingly, for a non-convex problem I worked on recently, results where actually much better with pre-shuffling compared to shuffling. My intuition was that when you pre-shuffle you ensure that the temporal distance between two consecutive appearance of any particular instance is maximized, so you minimize the risk of moving too much in some potentially harmful direction. I don't know of any theoretical results. It seems to me that this would be highly dependent on the specific error surface.

(Dec 16 '10 at 13:09) Oscar Täckström

I wonder if what Dumitru said is true of other algorithms as well, eg, backprop (with/without 2nd order methods), kernel methods, etc.

(Dec 16 '10 at 13:53) Brian Vandenberg
1

Brian: backpropagation is just an algorithm for computing the gradient with respect to some training data, not a particular optimization algorithm. All of Doomie's comments should apply to any system that learns by optimizing an objective function (kernel methods are a little trickier in this respect, as it would have to be something that's easy to optimize in the primal, which limits your kernel choices considerably).

(Dec 16 '10 at 15:22) David Warde Farley ♦
1

It's not particularly hard to verify that Léon's experiments can be reproduced with other gradient-based methods / models / datasets. I might even do it over the holidays :). The real question (for me) is whether the cycling/shuffling methods will lead to more or faster over-fitting. Also, with the datasets that we're currently interested in, we never really go twice through the training set, so things like Yoshua Bengio's curriculum learning idea (ordering examples from easier to more difficult in an online learning scenario) might be more interesting.

(Dec 16 '10 at 15:36) Dumitru Erhan
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.