So, I am trying to distribute a massive structured prediction problem. I am treating this as an online learning problem where I adjust the weights as I encounter training points. I have a large number of data points and want to distribute the training across a cluster. I recall reading a paper that says I can partition my training data, train each batch separately and average the weights. Is this actually true? If this is true, are there any papers which include a proof of why this works? The closest paper I found was Efficient LargeScale Distributed Training of Conditional Maximum Entropy Models by Mann et al asked Jun 26 '11 at 20:26 zaxtax ♦ 
You certainly can do this. It's true for essentially the same reason that stochastic gradient descent works. Each piece of data approximates the same gradient, averaging (or taking these gradients consecutively with small step sizes, e.g. SGD) will give you a better approximation of the true gradient. You should make sure (at whatever interval is computationally efficient) that your cluster doesn't stray too far out of sync (i.e. different batches start finding different local minima). Just do this by averaging all of them a few times through the learning process instead of just once at the end. answered Jun 26 '11 at 22:06 Jacob Jensen Great, do you know a paper with a proof showing averaging gives a better approximation?
(Jun 26 '11 at 23:45)
zaxtax ♦
You might find this useful: http://martin.zinkevich.org/publications/nips2010.pdf
(Jun 27 '11 at 10:26)
Justin Bayer
Well, there's this paper: http://www.icml2011.org/papers/398_icmlpaper.pdf by Welling and Teh but I feel like there should be something older with the result. Bengio et al also do something where they stagger each processor and then just pass the parameters normally (which results in a pipeline that is effectively the same as regular batch learning). I would also look into the stuff done by Vowpal Wabbit, a library designed for massive scale (and presumably parallel) learning. I may be wrong when I said this was basically trivial in my post.
(Jun 27 '11 at 10:28)
Jacob Jensen

also have a look here: http://www.ryanmcd.com/papers/parallel_perceptronNAACL2010.pdf for the structured perceptron case. answered Jun 28 '11 at 02:51 yoavg 