I am working with DNA sequences and I obtained over 90% AUC yet I am not sure how to infer which features are predictive of classes

asked Sep 23 '10 at 15:20

Sandra%20Smieszek's gravatar image

Sandra Smieszek
43334

edited Sep 24 '10 at 15:35

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

Do you really mean reconstruct? Or do you just mean that you want to figure out which features contribute to actual outputs?

(Sep 23 '10 at 16:55) Joseph Turian ♦♦

the latter, I simply want to find out which combinations of features are representative of which class, what I currently have is that using top 30 out of 1500 features I am able to get 97 AUC, that gives me a number top 30 yet I have no clue which features are representative of which class

(Sep 24 '10 at 05:34) Sandra Smieszek

I did do feature selection using VIMP

(Sep 24 '10 at 05:35) Sandra Smieszek

3 Answers:

there are some general "black box" strategies for this. please see Strumbelj and Kononenko "explaining classifications for individual instances" (2008), Strumbelj et al. "explaining instance classifications with interactions of the feature values" (2009), Strumbelj and Kononenko, "An Efficient explanation of individual classifications using game theory" (2010). there is also Baehrens et al. "how to explain individual classification decisions"(2010)

answered Sep 25 '10 at 13:25

downer's gravatar image

downer
54891720

There is no general-purpose way of doing that. Random forests are very nonlinear and hence use all features and some feature interactions to solve a problem. You can try model compression (Bucila et al) to get a simpler model that predicts faster and can be more easily interpreted.

If you want a clear statement of which features are most relevant it might be better to use a different, simpler model (such as the lasso) to do feature selection and then train random forests on just those features to see if the performance doesn't decrease much.

answered Sep 23 '10 at 16:34

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Alex's proposed strategies of model compression or using simpler models are good advice.

If you prefer to directly analyze the random forest model though, here is what you can do. Step 1: derive estimates of how strongly each input influences the model's output (maybe you have this already from the VIMP feature selection). Step 2: generate partial dependence plots to visualize the approximate relationship between important inputs and the output. I've done this and it is not too hard; you can find the write up in Mining citizen science data to predict prevalence of wild bird species to get some more details.

If you suspect statistical interactions between inputs are important (that is, the output is a non-additive function of inputs, e.g. y = x1*x2), you might want to look at Detecting Statistical Interactions with Additive Groves of Trees to check for important interactions. Then you can do partial dependence plots that show the joint effect of a pair of inputs on the output. Code implementing this algorithm is here. You should expect the additive groves to be more computationally expensive than random forests with some parameters to tune.

answered Sep 24 '10 at 15:26

Art%20Munson's gravatar image

Art Munson
64611316

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.