|
I implemented a multi class classifier using Logistic Regression in Scikit-learn. I had an initial set of features and gave me an initial accuracy number that I measured using cross-validation. Then, I added another feature but the accuracy did not improve at all. I checked for the chi-squared value of this feature and the outcome (class label for each document), and it turns out that the chi-squared value is in fact high. The p-value turned out to be 0.00003.. which is very less. Considering that the null hypothesis to be that the feature is not related to the outcome variable, here we easily can reject that. Even though the low p-value indicates causality, why is there no significant improvement in the accuracy of the classifier? Is there something strange going on here? I don't have any particular code to share. |
|
Did you check for correlation. If two variables are very correlated, they can both have small P-Values with respect to the response. But if they are very correlated, they might not improve the classification at all. This is not true for every classifier, but as you increase the number of variables, adding correlated variables does not really makes much of a difference. Thanks Leon. Let me do this and get back with an answer. I can correlate what you said here with my experiment and your answer makes a lot of sense now.
(Feb 02 at 12:02)
Abhishek Shivkumar
|