Revision history[back]
click to hide/show revision 1
Revision n. 1

Jan 28 '11 at 06:33

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
2039133450

The problem with performing feature selection as a pre-processing step is that you fail to take the interaction of the feature representation and the learning algorithm into account. For example, if you use e.g. mutual information as a criterion, you assume conditional independence between features. The problem with forward selection is that it's greedy: asume you have features f1,f2,f3,f4 such that f1 and f2 are sightly discriminative, f3 and f4 together are the most discriminative features, but f3 or f4 alone are the least discriminative. Then the algorithm will select f1 and f2, but not f3 and f4. Backward elimination handles this better, but still is only locally optimal.

If you instead use L2-regularized SVM, you will take interactions between features into account and you will get a globally maximal optimum, though you will need to tune the regularization parameter. Still you should be careful about only adding features that you have good reason to believe will improve the results.

click to hide/show revision 2
Revision n. 2

Jan 28 '11 at 06:37

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
2039133450

The problem with performing feature selection as a pre-processing step is that you fail to take the interaction of the feature representation and the learning algorithm into account. For example, if you use e.g. mutual information as a criterion, you assume conditional independence between features. The problem with forward selection is that it's greedy: asume you have features f1,f2,f3,f4 such that f1 and f2 are sightly discriminative, f3 and f4 together are the most discriminative features, but f3 or f4 alone are the least discriminative. Then the algorithm will select f1 and f2, but not f3 and f4. Backward elimination handles this better, but still is only locally optimal.

If you instead use L2-regularized SVM, you will take interactions between features into account and you will get a globally maximal optimum, though you will need to tune the regularization parameter. Still you should be careful about only adding features that you have good reason to believe will improve the results.results. As Alexandre points out, if you need to use feature selection, it probably means you added a poorly designed feature in the first place.

click to hide/show revision 3
Revision n. 3

Jan 28 '11 at 06:38

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
2039133450

The problem with performing feature selection as a pre-processing step is that you fail to take the interaction of the feature representation and the learning algorithm into account. For example, if you use e.g. mutual information as a criterion, you assume conditional independence between features. The problem with forward selection is that it's greedy: asume you have features f1,f2,f3,f4 such that f1 and f2 are sightly discriminative, f3 and f4 together are the most discriminative features, but f3 or f4 alone are the least discriminative. Then the algorithm will select f1 and f2, but not f3 and f4. Backward elimination handles this better, but still is only locally optimal.

If you instead use L2-regularized SVM, you will take interactions between features into account and you will get a globally maximal optimum, though you will need to tune the regularization parameter. Still you should be careful about only adding features that you have good reason to believe will improve the results. As Alexandre points out, if you need to use feature selection, it probably means you added a poorly designed feature in the first place.

click to hide/show revision 4
Revision n. 4

Jan 28 '11 at 06:40

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
2039133450

The problem with performing feature selection as a pre-processing step is that you fail to take the interaction of the feature representation and the learning algorithm into account. For example, if you use mutual information as a criterion, you assume conditional independence between features. The This is also a problem with forward selection is that it's greedy: selection: asume you have features f1,f2,f3,f4 such that f1 and f2 are sightly discriminative, f3 and f4 together are the most discriminative features, but f3 or f4 alone are the least discriminative. Then the algorithm will select f1 and f2, but not f3 and f4. Backward elimination handles this better, since you start with all features active you take interactions into account, but it is still is only locally optimal.

If you instead use L2-regularized SVM, you will take interactions between features into account and you will get a globally maximal optimum, though you will need to tune the regularization parameter. Still you should be careful about only adding features that you have good reason to believe will improve the results. As Alexandre points out, if you need to use feature selection, it probably means you added a poorly designed feature in the first place.

click to hide/show revision 5
Revision n. 5

Feb 02 '11 at 17:17

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
2039133450

The problem with performing feature selection as a pre-processing step is that you fail to take the interaction of the feature representation and the learning algorithm into account. For example, if you use mutual information as a criterion, you assume conditional independence between features. This is also a problem with forward selection: asume assume you have features f1,f2,f3,f4 such that f1 and f2 are sightly discriminative, f3 and f4 together are the most discriminative features, but f3 or f4 alone are the least discriminative. Then the algorithm will select f1 and f2, but not f3 and f4. Backward elimination handles this better, since you start with all features active you take interactions into account, but it is still only locally optimal.

If you instead use L2-regularized SVM, you will take interactions between features into account and you will get a globally maximal optimum, though you will need to tune the regularization parameter. Still you should be careful about only adding features that you have good reason to believe will improve the results. As Alexandre points out, if you need to use feature selection, it probably means you added a poorly designed feature in the first place.place. In the case of small sample sizes and large number of features, L1-regularization seems to have a bit stronger theoretical guarantees, but I'm not aware of any practical results that support this in general.

click to hide/show revision 6
Revision n. 6

Mar 11 '11 at 10:09

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
2039133450

The problem with performing feature selection as a pre-processing step is that you fail to take the interaction of the feature representation and the learning algorithm into account. For example, if you use mutual information as a criterion, you assume conditional independence between features. This is also a problem with forward selection: assume you have features f1,f2,f3,f4 such that f1 and f2 are sightly discriminative, f3 and f4 together are the most discriminative features, but f3 or f4 alone are the least discriminative. Then the algorithm will select f1 and f2, but not f3 and f4. Backward elimination handles this better, since you start with all features active you take interactions into account, but it is still only locally optimal.

If you instead use L2-regularized SVM, you will take interactions between features into account and you will get a globally maximal optimum, though you will need to tune the regularization parameter. Still you should be careful about only adding features that you have good reason to believe will improve the results. As Alexandre points out, if you need to use feature selection, it probably means you added a poorly designed feature in the first place. In the case of small sample sizes and large number of features, L1-regularization seems to have a bit stronger theoretical guarantees, but I'm not aware of any practical results that support this in general.

Edit: If features for some reason are expensive to obtain, which is common in the medical domain, then feature selection might be very important. However, performing feature selection as a pre-processing step in order to reduce the number of words in a BoW representation for an SVM classifier does not seem like a good idea to me.

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.