Could someone please give me some advice as to how I can go about identifying irrelevant features from a collection of features in a training set.

I am interested to know what the most effective techniques are would appreciate some practical guides or resources where I could learn more about this from a practical perspective (if Matlab specific that would also be great).

Also is it reasonable to expect substantial improvements (to training and cross validation cost) if irrelevant features are identified and removed (the ratio of training examples to features is about 130/1)?

asked Feb 04 '14 at 22:56

Farzan%20Maghami's gravatar image

Farzan Maghami
1111


One Answer:

There are two different approaches for this built into Matlab:

  1. Use a stepwise procedure to add/remove variables one by one into/from the model looking for improvements. Checkout the documentation for stepwiseglm in the Statistics Toolbox. Note this is a greedy procedure so it is not the same as exhaustively searching over all variable combinations.

  2. Use a regularization (penalty) term added to the objective function. The most common terms are adding a L1 (absolute value) or L2 (sum of squares) penalty term on the weight vector. In the GLM world, L1 regularization is called the "Lasso" penalty and L2 is called "ridge regression" and a combination of them is called "elastic net". The lasso penalty often results in some variables being completely excluded from the model whereas the L2 penalty will have small but nonzero values for irrelevant predictors but has stronger control over large coefficients. The Matlab function lassoglm implements both of these penalties.

You might as well try both since they are built in functions. If you have a very large number of features the stepwise procedure might be too inefficient (and suboptimal from the greedy nature of the algorithm); otherwise either one could work.

answered Feb 06 '14 at 21:50

Dan%20Ryan's gravatar image

Dan Ryan
40671116

Hi Dan,

Thank you for taking the time to respond, I really appreciate it :)

I am actually trying the first method and evaluating the F1 score (I am doing linear regression) for the train and test sets for every feature I remove (i.e 1000 features; remove one feature-->Train the logistic regression model-->calculate the F1 score for Train and Test sets-->replace the feature and remove the next....Continued)

Then I am going to compare the F1 scores at the end, I am thinking that when a feature is removed and the F1 score goes up (slightly) then it is an indication that it is a slightly irrelevant feature, is this a correct assumption?

Regarding the regularisation, I find that with the full feature set I have a lot of bias, wont the introduction of the regularisation parameter result in further bias?

(Feb 07 '14 at 08:42) Farzan Maghami

As for removing features using F1 score as a criteria I think you are right on track.

For the regularization question, yes, usually regularization decreases variance and increases bias, but maybe a small amount could still help if you have a lot of irrelevant features. A small coefficient shouldn't change the overall bias much, it should only zero out features with a very weak signal. Also, I think removing features is itself effectively a form of regularization...

(Feb 08 '14 at 17:33) Dan Ryan
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.