I implemented a multi class classifier using Logistic Regression in Scikit-learn.

I had an initial set of features and gave me an initial accuracy number that I measured using cross-validation.

Then, I added another feature but the accuracy did not improve at all.

I checked for the chi-squared value of this feature and the outcome (class label for each document), and it turns out that the chi-squared value is in fact high. The p-value turned out to be 0.00003.. which is very less. Considering that the null hypothesis to be that the feature is not related to the outcome variable, here we easily can reject that.

Even though the low p-value indicates causality, why is there no significant improvement in the accuracy of the classifier?

Is there something strange going on here? I don't have any particular code to share.

asked Feb 02 at 09:38

Abhishek%20Shivkumar's gravatar image

Abhishek Shivkumar
36558


One Answer:

Did you check for correlation. If two variables are very correlated, they can both have small P-Values with respect to the response. But if they are very correlated, they might not improve the classification at all. This is not true for every classifier, but as you increase the number of variables, adding correlated variables does not really makes much of a difference.

Check this text

answered Feb 02 at 10:39

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128

edited Feb 02 at 10:40

Thanks Leon. Let me do this and get back with an answer. I can correlate what you said here with my experiment and your answer makes a lot of sense now.

(Feb 02 at 12:02) Abhishek Shivkumar
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.