|
Could someone please give me some advice as to how I can go about identifying irrelevant features from a collection of features in a training set. I am interested to know what the most effective techniques are would appreciate some practical guides or resources where I could learn more about this from a practical perspective (if Matlab specific that would also be great). Also is it reasonable to expect substantial improvements (to training and cross validation cost) if irrelevant features are identified and removed (the ratio of training examples to features is about 130/1)? |
|
There are two different approaches for this built into Matlab:
You might as well try both since they are built in functions. If you have a very large number of features the stepwise procedure might be too inefficient (and suboptimal from the greedy nature of the algorithm); otherwise either one could work. Hi Dan, Thank you for taking the time to respond, I really appreciate it :) I am actually trying the first method and evaluating the F1 score (I am doing linear regression) for the train and test sets for every feature I remove (i.e 1000 features; remove one feature-->Train the logistic regression model-->calculate the F1 score for Train and Test sets-->replace the feature and remove the next....Continued) Then I am going to compare the F1 scores at the end, I am thinking that when a feature is removed and the F1 score goes up (slightly) then it is an indication that it is a slightly irrelevant feature, is this a correct assumption? Regarding the regularisation, I find that with the full feature set I have a lot of bias, wont the introduction of the regularisation parameter result in further bias?
(Feb 07 '14 at 08:42)
Farzan Maghami
As for removing features using F1 score as a criteria I think you are right on track. For the regularization question, yes, usually regularization decreases variance and increases bias, but maybe a small amount could still help if you have a lot of irrelevant features. A small coefficient shouldn't change the overall bias much, it should only zero out features with a very weak signal. Also, I think removing features is itself effectively a form of regularization...
(Feb 08 '14 at 17:33)
Dan Ryan
|