|
I have this simple idea that is probably pretty idiotic but I fail to see the problems with it and I appreciate it if someone points them out. Let's say I have a massive amount of data and I like to fit a linear regression model to it. Also I am okay with having an approximate model that is not exact and the only thing that I care about is to be sure that as I add more data my accuracy increases. I understand that Andrew NG's paper on "Map-Reduce for Machine Learning on Multicore" is one way of doing that but say I do not want to go into the trouble of implementing a new algorithm. So I am willing to pay the penalty on accuracy to have something that uses my current solver. Here is my idea for making a poor-men's linear regression on big-data: Say we have 10 machines that we can use. I split my training set into 10 pieces and send each slice to a machine and will fit a regression line to it. in my reduce phase I find the average model (basically the regression line that is found by finding the angel bisector between each two lines) I like to argue that new regression line will better fit my test data than each individual regression line that was learned by using a small set of data. What are the problems with this model? |
|
I can't give an analysis of your proposal, but I can suggest pointers. I would suggest you to formalize this as a random projection or randomized linear algebra problem. The splitting is in fact a random projection. In this formalism you can check whether your algorithm can give a good approximation. Specifically, two pointers (metaoptimize questions with associated discussions):
|
|
So you're right that the quality of the average regression line is better than the average quality of each regression line (as that's just Jensen's inequality). However, doing what you described in general is not always a good idea for harder problems (for easier problems, on the other hand, as each machine might as well have enough data to compute a really accurate solution, it won't matter much). See Mcdonald et al Distributed training strategies for the distributed perceptron for a discussion of problems with this kind of averaging. In theory and practice, however, with two or three passes, where in each pass you optimize a proximal problem (x_t = minimize ||Ax - b||^2 + lambda* ||x - x_(t-1)||^2) and then average should give you a very fast convergence to small error (as that's a poor man's method of multipliers). |