|
I have a dataset of some 100 samples, each with >10,000 features, some of which highly correlated. Here's what I am doing currently. 1.) Split the data set into three folds. 2.) For each fold,
3.) Select features corresponding to the non-zero weights returned from 2.2 4.) Use these features for training and testing SVM. My problem is that in step 3, for each fold I get a different set of features. How do I get one final model out of this? One final list of relevant features? Can I take an intersection of the selected features in step 3 for all folds? Features that are selected in all three folds would appear to be the most stable/significant. Can I do this, or is it cheating? |
|
FYI, a similar scheme (stability selection using Randomized LASSO) is directly implemented in scikit-learn with a pipelineable transformer pattern for the feature selector part:
Now to answer your original question you could try to use the intersection and see it works, but if you want to have some theoretical background you should read Meinshausen and Buhlmann who further use randomized feature scaling and optionally boostraps of the samples as in BoLASSO. Also a more rigorous example of RandomizedLasso is available in the documentation (even though the plots seem currently broken now...). |