|
What would be the best feature selection method in conjunction with Random Forest with 65000 features? Of note would You think RF would choke on a matrix of 65000x10,000 dimensions? |
|
This is an old question, but perhaps the following will be helpful to someone searching in the future. Are you asking about feature selection to a) make learning faster, b) improve prediction accuracy, c) to make studying the resulting model easier, or d) reduce data collection costs? If (a) AND your 65K features are sparse, look at FEST, an implementation of decision tree ensembles for sparse data (includes random forests). This may very well be fast enough that you can skip feature selection. If (b), you should reconsider. Bagging has been shown to improve robustness to weak and irrelevant features, and I see no reason why the same should not be true for random forests. For details, see Ali & Pazzani (1996). Error reduction through learning multiple descriptions. Machine Learning, 24 and Munson & Caruana (2009). On feature selection, bias-variance, and bagging. In ECML PKDD. If (c) or (d), I recommend looking at Tuv, Borisov, Runger, & Torkkola (2009). Feature selection with ensembles, artificial variables, and redundancy elimination. Journal of Machine Learning Research, 10. As long as you can afford to run random forests, this is a great way to do feature selection (esp. if the final model will be a random forest). Finally, if you really need to do feature selection as a preprocessing step (e.g., your data is dense and running random forests is not feasible), you will want to look into filter methods. (See Guyon & Elisseeff (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3.) Start with simple methods and move to more sophisticated tools if necessary. |
|
Features are more important to keep down then the number of instances, that is true. I believe that your dataset will probably choke on most normal hardware with most normal implementations of RF (although you will have to test to be sure). If it doesn't work and you know how to distribute computing, go ahead and try that. If you don't know how to do that (and you can't find an algorithm that already does it) you may need to explicitly partition your features and create... an ensemble of ensembles. With 65000 features, you get 14 lots of 5000 features, which is probably more doable, so try getting RF on each of those, then ensemble the results. One more thing: from Wikipedia: Random forests do not handle large numbers of irrelevant features as well as ensembles of entropy-reducing decision trees. Try compress features by finding features with a high covariance. PCA might be a little impossible to pull off with a dataset of this size, but you may be able to find a large number of redundant features to remove
(Nov 04 '10 at 18:42)
Robert Layton
One needs to pay attention to how random forest is defined. The Wikipedia article defines them in the same way as Breiman, with the search for the best feature to split on restricted to a small random subset of all the features. Some papers, however, define random forests as a collection of fully random trees. In other words, the feature to split on is chosen randomly. The claim in the Wikipedia article that RF do not handle large numbers of irrelevant features is supported by a citation to a paper that appears to use fully random trees in the RF ensemble. I suspect this is part of the reason the RF does not handle irrelevant features as well as bagged trees.
(Feb 01 '11 at 13:25)
Art Munson
|