So, I am trying to distribute a massive structured prediction problem. I am treating this as an online learning problem where I adjust the weights as I encounter training points. I have a large number of data points and want to distribute the training across a cluster. I recall reading a paper that says I can partition my training data, train each batch separately and average the weights. Is this actually true? If this is true, are there any papers which include a proof of why this works?