I am working with large matrices. Would like to model the output $Y = X beta + epsilon$

Here Y is an m x n matrix. X is m x k matrix and beta is a k x n matrix. I know the matrices Y and X and would like to find the matrix beta. The linear regression assumes solution needs inverting the matrix X. But the matrix that I have is very big some thing like 23000 x 50000. How should one go about finding the matrix beta. If we do not use the exact formula.

Thanks

asked Dec 11 '12 at 05:08

Shishir%20Pandey's gravatar image

Shishir Pandey
6225


One Answer:

What method to use depends on several factors:

  • how sparse are X and Y
  • the relative size of k, m & n
  • distrutions do the data have in each of the columns of X & Y

Presumably 23000 x 50000 is the X matrix, meaning there are more features than training samples (k>m), so you will want some form of regularization. If n is also large (ie Y is relatively wide) the partial least squares would be appropriate and you can regularize by the dimention of the intermediate space. If n is small (narrow Y) then you can treat the problem as n separate regression problems and since k > m, you would want to use some form of iterative method like Pegasos/SGD or LARS rather than the explict formula. Pegasos/SGD is probably preferable for sparse X and LARS for dense X. These iterative algorithms also provide built in regularization. There are free implementations of all of the above methods. If you work with python all are provided in the scikit package and for R you can look on cran find them on CRAN.

Another option in case of narrow Y, is to perform some form of dimension reduction on X , and then do ordinary linear regression on the reduced X using the explicit formula.

If enough of the columns in X or Y have unusual (non-gaussian) distributions you may need some form of robust regression. Particularly if Y has a longtail distribution (ie outliers), Laplace regression would be appropriate. Laplace regression can be modularly incorporated into iterative algorithms sice it is just a change of loss function.

answered Dec 11 '12 at 08:22

Daniel%20Mahler's gravatar image

Daniel Mahler
122631322

edited Dec 11 '12 at 11:10

Yes, 23000 x 50000 is the X matrix. 0.1% of X entries are known. 0.2% of Y entries are known. Y is a narrow matrix - 23000 x 313.

(Dec 11 '12 at 13:44) Shishir Pandey

re: 0.1% of X entries are known. 0.2% of Y entries are known does that mean that the remaining fields are unkown/missing rather thanjust 0? That would make it a different kind of problem.

313 columns in Y is kind of in between what I think of as wide & narrow. Given that X is very sparse doing multiple runs of Pegasos seems reasonable. Partial least squares may also be good, it may work better given X & Y are both sparse, but I am not really an expert on PLS.

(Dec 11 '12 at 16:27) Daniel Mahler

In both cases the remaining entries are 0. I am using the rcv1v2 (industries) dataset from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html

(Dec 11 '12 at 16:50) Shishir Pandey

That is a really a classification rather than a (linear) regression problem. You can model it with linear regrassion, but I would not expect it to work well. LARS, SGD and Pegasos can be used for classification as well. Given that you have a lot of classes and a large sparse input matrix, you could try 3 layer network with a relatively narrow middle layer for regularization. The Winnow algorithm should also do well. SNoW is a nice implementation of winnow

(Dec 11 '12 at 17:33) Daniel Mahler

I tried the SGD Regressor and Classifier in scikit-learn it does not take Y as a matrix. What do you suggest?

(Dec 12 '12 at 04:18) Shishir Pandey

SGDClassifier will take a vector of the class labels as integers. Make sure you represent X as a scipy.sparse.csr matrix rather than a regular numpy matrix.

(Dec 12 '12 at 10:47) Daniel Mahler
showing 5 of 6 show all
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.