4
2

Hi,

I have big datasets for which I want to compute least square solutions. The problem is big in all directions: many samples (50,000) and features (2,000), and many regressors (1000). I was wondering what the right numerical approaches were for this problem. In particular, given that truncated SVDs can be solved efficiently by randomized projection (code, metaoptimize discussion), I was wonder if there were some tricks to apply to solve directly the least squares problem with randomized projections.

This question is marked "community wiki".

asked Nov 24 '10 at 10:27

Gael%20Varoquaux's gravatar image

Gael Varoquaux
92141426

1

What about Partial Least Squares to solve the regression problem at hand?

(Nov 25 '10 at 02:36) osdf

Yeah. Putting the scalability issues aside, I think PLS or Canonical Correlation Analysis (CCA) could actually be a good idea since there are many regressors (I guess "regressors" means responses). Since it's quite likely that the multiple responses might be related to each other it's more like a multitask/multilabel/multi-response like setting. For such cases, CCA could be used to extract a subspace shared between the features X and the responses Y, and train the model using features from this subspace. See for example Extracting shared subspace for multilabel classification and Multi-label prediction via sparse infinite CCA for multilabel (i.e., discrete responses) setting but the same idea applies to regression (i.e., continuous responses) problems too.

Also, as for the scalability issue, I don't think applying something like PLS or CCA to the problem scale you mentioned (2000 features, 1000 responses, 50k examples) should be an issue.

(Nov 25 '10 at 02:54) spinxl39

Thanks for all the suggestions. I must do a bit of reading before giving more input.

However, with regards to PLS or CCA, the reason that I need the regression is actually to compute the residuals in a sparse PCA framework (or dictionary learning), of which CCA is pretty much a special case. CCA would be probably just as expensive as the regression, as it requires whitening the data before hand. I still haven't wrapped my head around PLS, as I can't understand the criteria that it optimizes.

Turns out that in the mean time, I found out that I really wanted to regularize a tiny bit my regression, as without any regularization it fall in instabilities. But the papers Alex and Spinxl pointed out seem to give a good framework for using a ridge.

(Nov 25 '10 at 03:19) Gael Varoquaux
1

My point behind using CCA is that you can use it for doing supervised dimensionality reduction using both the features X as well as responses Y. So basically you first run CCA on the pair X and Y. Once you have the dimensionality reduced features (say Z) from CCA, training the regression model on (Z,Y) pair instead of the (X,Y) pair can be advantageous since the data dimensionality is reduced as well as the new features Z are expected to have a better predictive power (because you have also considered the responses Y in computing the low dimensional projection of the data). Besides, working in the lower dimensional space also does regularization in a way.

(Nov 25 '10 at 03:39) spinxl39

2 Answers:

Although not a random projections based approach, the paper Sampling Algorithms for l2 Regression and Applications talks about a randomized sampling schemes to select a good representative set of examples, and solves the problem using this smaller set of examples.

Another paper Faster Least Squares Approximation presents a preconditioning approach to select a smaller set of constraints (akin to random sampling) to solve the problem over. They also have a sparse random projections based algorithm somewhat along the lines of the Sarlos reference Alexendre pointed above.

This answer is marked "community wiki".

answered Nov 24 '10 at 11:23

spinxl39's gravatar image

spinxl39
3698114869

Hey Spinxl, I am marking your answer as the 'accepted answer' as I thought that the paper 'Faster Least Squares Approximation' was a really excellent read.

I have managed to reorganize my code a bit, and I don't need to compute that least-square estimate as much. In addition, a very simple technique of precomputing the pseudo-inverse, and applying to slices of the target data was enough to decrease the computation time a lot in my case. Thus I won't be in the short term implementing any of these ideas. :)

Thank you however for all answers (Alexandre and Spinxl) as they give a lot of food for thoughts.

(Nov 25 '10 at 16:40) Gael Varoquaux

I've never looked carefully at it either, but if you're willing to assume the (using wikipedia notation) data matrix X has low rank, then you can compute the pseudoinverse of X^T X quickly by using a random projections truncated SVD, inverting the singular values vector (replacing each entry s in sigma by 1/s) and remultiplying to get (X^T X)^-1, which you can then multiply by X^T y to get the coefficient vectors.

The paper A fast randomized algorithm for overdetermined linear least-squares regression, by Rokhlin and Tygert, seems to present an algorithm that's provably correct for full-rank matrices, which is not true of this one I presented above. Their algorithm essentially computes a fast randomized fourier transform E of the data matrix X, and then uses a pivoted QR decomposition (that will return Q, R, and a permutation pi), use this to compute the preconditioner P = Rpi and solve the least squares problem (minimizing ||XP^-1 - b||) with the conjugated gradient method, which will necessarily converge faster because XP^-1 must a have a good condition number with high probability.

The paper Improved Approximation Algorithms for Large Matrices via Random Projections by Sarlos presents a seemingly faster algorithm. Let S be a random N x K matrix (where N is the number of data points and K is a certain truncation value), if you solve exactly the problem ||X S - S^T b|| you have a good enough approximation to the original solution (this can be made faster by using a better matrix multiplication algorithm and a better pseudoinverse algorithm using random projections, but I'm a bit fuzzy on the details, as this is quite a theoretical paper).

This answer is marked "community wiki".

answered Nov 24 '10 at 10:38

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Nov 24 '10 at 10:51

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.