I have a dataset of some 100 samples, each with >10,000 features, some of which highly correlated. Here's what I am doing currently.

1.) Split the data set into three folds.

2.) For each fold,

  2.1) Run elastic net for 100 values of lambda. (this returns a nfeatures x 100 matrix)
  2.2) Take a union of all non-zero weights. (returning a nfeatures x 1 vector)

3.) Select features corresponding to the non-zero weights returned from 2.2

4.) Use these features for training and testing SVM.

My problem is that in step 3, for each fold I get a different set of features. How do I get one final model out of this? One final list of relevant features? Can I take an intersection of the selected features in step 3 for all folds? Features that are selected in all three folds would appear to be the most stable/significant. Can I do this, or is it cheating?

asked Mar 02 '12 at 10:52

Tosif%20Ahamed's gravatar image

Tosif Ahamed
81246


One Answer:

FYI, a similar scheme (stability selection using Randomized LASSO) is directly implemented in scikit-learn with a pipelineable transformer pattern for the feature selector part:

>>> from sklearn.linear_model import RandomizedLasso
>>> from sklearn.svm import SVC
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.datasets import load_digits
>>> from sklearn.cross_validation import train_test_split

>>> digits = load_digits()
>>> X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)

>>> p = Pipeline([
...     ('selector', RandomizedLasso()),
...     ('clf', SVC(kernel='rbf', C=1000, gamma=0.001))])
...
>>> p.fit(X_train, y_train).score(X_test, y_test)
Warning: invalid value encountered in divide
Warning: invalid value encountered in divide
Warning: invalid value encountered in divide
Warning: invalid value encountered in divide
0.99111111111111116

>>> p.steps[0][1].get_support()
array([False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True, False,  True,  True,  True,  True,  True,  True,  True], dtype=bool)

Now to answer your original question you could try to use the intersection and see it works, but if you want to have some theoretical background you should read Meinshausen and Buhlmann who further use randomized feature scaling and optionally boostraps of the samples as in BoLASSO.

Also a more rigorous example of RandomizedLasso is available in the documentation (even though the plots seem currently broken now...).

answered Mar 02 '12 at 12:19

ogrisel's gravatar image

ogrisel
498995591

edited Mar 02 '12 at 12:19

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.