4
2

I am interested in using L1 and other sparse methods for feature selection for SVMs.

When the number of features is much much larger than the data size i.e. when p>>N. L1 regularized or other sparse methods generally do a better job, and they have embedded feature selection. But when many features are highly correlated, the variance in performance and feature selection is generally large because the correlated variables are not chosen. I was wondering if the following would work better,

1.) Train a sparse/L1 classifier.

2.) Get all the features with non-zero coefficients/weights.

3.) Form a correlation matrix of the dataset C, a pxp matrix of correlation coefficients.

4.)For each feature selected in 1. select features from C that are significantly well correlated (above a certain threshold).

Is this a good strategy to try out. I know about other methods like Elastic net, which have an L2 term as well to give weight to the correlated variables. But I was just wondering if there is any credence to this simpler approach.

asked Mar 05 '12 at 07:28

Tosif%20Ahamed's gravatar image

Tosif Ahamed
81246


2 Answers:
10

A few points:

First, if you have p>>N, you cannot estimate reliably the correlation matrix. I don't think that it is useful to try to embedded it in the feature selection method.

Second, for highly-correlated variables, L1 tends not to work well. This is related to the recovery problem in compressive sensing, you can have a look at my answer to an earlier question. Basically, there are two different problems:

  1. When several relevant variables are highly correlated, L1 will select one of them arbitrarily, leaving out the others

  2. If the relevant variables are correlated to irrelevant variables, L1 will have a hard time singling out the right variables

Another point to keep in mind is that L1 can only select at most n variables by itself. If you have highly correlated variables, chances are that you want to select more.

Variations of L1 can partly address these problems:

  1. ElasticNet, that you mention. What ElasticNet really brings to the problems of L1 is that is mitigates problem 1 above. See Jia, J. and Yu, B. On model consistency of the elastic net when p >> n

  2. RandomizedLasso: reestimating the sparse linear model under perturbations of the data and the design matrix. This method actually addresses the problem 2 above. See Meinshausen, N. and B ̈hlmann, P. Stability selection.

These methods, in my experience do not solve both problems together, and they have the drawback of bringing in additional metaparameters difficult to set, and to take significantly more computational power than plain L1.

Actually, in many many situations, simple univariate testing gives better results than sparse methods for feature selection. See for instance Haury, A-C, Gestraud, P, and Vert, J-P. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures.

Edit: The cat is out of the bag: we have addressed this problem is specific settings in some work published at ICML: feature clustering + sparsity is a good solution, as long as you randomize. Please cite us if you use successfully this idea.

answered Mar 05 '12 at 09:34

Gael%20Varoquaux's gravatar image

Gael Varoquaux
92141426

edited Aug 03 '12 at 04:15

Thank you very much for the response and the links. You mention that the correlation matrix cannot be estimated reliably. The corrcoef function in matlab also outputs a matrix of t-values, so I can safely choose only those features which are significantly highly correlated. I tried it out and there seems to be some improvement, but I am not sure if it's just chance or significant.

The univariate methods certainly seem to be doing a better job currently, but I would prefer a multivariate method. Are there any other multivariate feature selection methods that I should try out?

(Mar 05 '12 at 10:02) Tosif Ahamed

Thresholding the covariance matrix doesn't help estimating it. On the contrary, it makes it non positive definite, and thus leads to a nonsensical definition of a covariance.

The problem is that the covariance matrix is a multivariate object. The juxtaposition of univariate estimates (bi-variate correlations, for instance) does not help estimating the multivariate structure of the covariance matrix, which is really what we are interested here. For instance, the eigenvector structure would give us the right loadings for defining sets of correlated variables. But a good estimation of this structure is hard.

It is possible to do a better job at estimating the covariance matrix by imposing a sparse prior on its inverse, which is a principled way of achieving some sort of thresholding (on its inverse, though, not on the covariance itself, as the inverse encodes partial correlations, and these are more likely to be sparse). However, in my experience, such fancy estimation scheme simply adds noise in a regression setting, and will not help your feature selection in general.

If you really want to try out fancy stuff, you could try using random forests for feature selection http://scikit-learn.org/dev/modules/feature_selection.html#tree-based-feature-selection. I don't see why random forests would be relevant in a linear regression setting though.

The question that comes to my mind is: why do you want that much to do multivariate feature selection? Can you show that in practice it brings anything to the table? There is an unfortunate bias to mathematical sophistication and I have seen more than once multivariate methods published on data where univariate methods actually work much better.

(Mar 05 '12 at 15:05) Gael Varoquaux

Thank you once again. So it seems I am headed the wrong way then. I am actually dealing with fMRI data and I want to choose stable features that are generalizable over a population. Univariate features are not doing satisfactorily in that case. It was just my intuition that since multivariate methods look at interactions between features, they might choose more stable and generalizable features. I am experimenting with things currently and trying out everything I can. So in a way I just want to see if and to what degree multivariate methods help.

(Mar 05 '12 at 21:11) Tosif Ahamed

Tosif, we have indeed looked at this specific problem, as we where working on fMRI data too. I have added in my answer a reference to our published paper.

(Aug 03 '12 at 04:16) Gael Varoquaux

It seems the trace Lasso solves exactly this problem: http://arxiv.org/abs/1109.1990 .

answered Mar 05 '12 at 12:23

Nicolas%20Le%20Roux's gravatar image

Nicolas Le Roux
7652912

3

The trace lasso is a very beautiful work that would solve this problem if the population covariance matrix was known. Unfortunately, all we have is the observed X, and in my experience, in the small sample limit it fails to define a good penalization in the trace lasso. A simple problem to illustrate the limitation of the trace lasso is that given n data points, it can only define n directions in which the trace norm is strongly convex.

(Mar 05 '12 at 14:54) Gael Varoquaux
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.