1
1

What are the best methods to do feature selection in a high-dimensional feature space (with #features > 10K), where there are considerably more features than examples? And are there particular settings to prefer one method over the other?

asked Feb 27 '12 at 05:19

Vam's gravatar image

Vam
105111415

It's very hard to tell anything useful without more details.

What kind of data is this? text features? e-commerce transaction logs? Gene expression microarray data? Is the data dense or sparse? How many samples do you have?

What are you trying to achieve? Classification, regression, unsupervised analysis such as topic modelling or clustering?

(Feb 27 '12 at 05:47) ogrisel

2 Answers:

The other option is to select features, rather than combinations of features. For this there are a few main approaches:

  • Do univariate tests on each feature, keep the N best,

  • Fit a sparse multivariate model, and keep only the features with non-zero weights,

  • Use a multivariate model that can keep an importance score, such as a random forest.

All these methods are documented in the feature-selection chapter of the scikit-learn documentation with examples, and you can find the code in the scikit-learn. If you need to translate the code to another language, the univariate tests will be fairly easy to translate, but the rest will require work.

Also, if you really care about speed and scaling to very large datasets, you probably want to stick to the univariate filtering approach.

answered Feb 27 '12 at 07:35

Gael%20Varoquaux's gravatar image

Gael Varoquaux
92141426

edited Feb 27 '12 at 16:09

There are some of the dimension reduction techniques and principal component analysis. You may use SVD(Singular Value Decomposition) which gives the direction of principal distribution of data. May be first 3 component of the D(diagonal Matrix) matrix gives amlost 95% of the distribution of your data.Evaluation of those component does not bring much change from the evaluation of whole the document.

answered Feb 27 '12 at 07:13

Kuri_kuri's gravatar image

Kuri_kuri
293273040

edited Feb 27 '12 at 07:19

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.