|
A time ago, someone posted a question that I think got a great discussion but never got answered (aside from the comments), and I think the topic is important enough to deserve its own question. Why does removing correlated features have a positive effect in higher dimensions, and why do correlated features work better in lower dimensions. (I already have my own answer, but if any one has intuitive explanations that I might steal....borrow, that would be great) Regards |
|
With a linear kernel, correlated inputs allow you to filter out independent noise - averaging However, adding more dimensions by definitions leads to the curse of dimensionality - ie you need exponential growth in your data to fill your space. Conversely, since you typically have a fixed data set and are free to choose the number of features, adding extra variables will make it more likely you overfit. so the answer is to do some sort of principal components on your features to get the benefit of averaging without inducing the overfitting problems of adding too many variables. BTW the original question seemed bogus - the difference was from 89% to 87%?! so probably not statistically significant. |
|
The main problem with correlated features is that they over emphasize certain aspects of the data. In particular a lot of machine learning algorithms implicitly assume a Euclidean metric on the feature space, since they use dot products of feature vectors to measure similarity or equivalently they use euclidean distance to measure difference. Thew extreme case of correlated features are duplicate feature, so you can think about the effect of duplicating a feature. This is equivalent to changing the metric on the original data by doubling the duplicated feature. The exact effect of this depends on the algorithm. L2 penalized methods will start to favour the duplicated feature since they can use it at reduced penalty. K-means clustering will produce clusters that are elongated in the original space. This is not always bad. Sometimes correlated features may indicate that some aspect of the data really is more important, particularly if the methods of obtaining those features are independent. However it is usually worth trying to decorrelate the data (using something like svd) at least to see if it helps. |