|
Currently I use SVM for learning a classification task. My concern is about orthogonality of features that I manually select. Is this the preassumption before we can apply any machine learning technique? Or, what would happen if some features are not orthogonal in terms of classifier performance? |
|
When you say your features are not orthogonal, I assume you mean that they don't vary independently in your dataset (e.g., they have nonzero covariance). In this case, I'm by no means an SVM expert, but I believe the SVM will perform just fine regardless. If you imagine pictorially what the (non-kernelized) SVM does in the two-dimensional case, you can convince yourself that non-independent features shouldn't be a problem in terms of classifier performance. |
|
Quick and dirty: calculate all-vs-all correlation coefficients between the features. Kevin's right though, many methods cope just fine with non-independent variables. Even some that aren't supposed to (Naive Bayes) often do anyway, to a point at least. If it's a big problem in your data, packages like RapidMiner have various tools for detecting and removing useless features, or collapsing higher-dimensional data down to lower-dimensional data with composite features. |
|
With highly nonlinear kernels it might be a good idea to whiten (make sure it's centered and has zero covariances) your data before running an SVM classifier, but with linear and slightly nonlinear kernels (like the quadratic or cubic kernel) this is unnecessary. A quick-and-dirty way to whiten your data is to run principal components analysis. Scikits.learn has a fast and easy to use implementation of PCA. Wouldn't PCA reduce the dimensionality of the data? Why not just doing the whiten without doing the projection.?
(Jan 19 '11 at 01:35)
Leon Palafox ♦
@Leon: You can use PCA to invert the covariance matrix and then whiten the data; also, projecting to reduce dimensionality might be a good idea.
(Jan 19 '11 at 04:20)
Alexandre Passos ♦
|
|
Practically speaking, non-orthogonal features will not hurt most machine learning algorithms. However, highly correlated features are usually a problem, especially when the inverse feature covariance matrix is used by the learning algorithm. For highly correlated features, the covariance matrix will be close to singular, which is very likely to cause numerical issues while optimizing your empirical risk. Note that independence => uncorrelatedness but not the other way around (except in case of jointly Gaussian features). I don't think that the SVM quadratic optimization problem involves inverting the feature covariance matrix. So it should not be affected. But something like logistic regression will definitely run into problems. If you are really interested in checking for pairwise independence of features, I suggest a Chi-Squared goodness of fit test. It will essentially check how distant the joint PDF is from the product of the marginal PDFs. Checking for uncorrelatedness is much more simple and you just need to compute the correlation coefficient. |
|
Attempting to reduce non-redundancy in feature space is often an aesthetic pursuit rather than something that is really meaningful in terms of how well your algorithm performs. If you have good features, they are expected to be correlated with the label and, hence, to a certain extent, with each other (this isn't strictly true, just in a hand-wavy way). That said, linearly dependent features [or close to linearly dependent] can cause problems to certain algorithms. I tend to remove these features (using a threshold for linear dependence), but not worry about "highly correlated, but independent". * As a thought experiment (again, hand wavy intuition building rather than precise theorem proving): You have 100 features. 50 are really just random wrt. label, the other 50 are very close to each other and to the label. The classification will probably be very good with an RBF SVM. However, if removing "redundant features" leaves you with 50 random features and 1 good features, you will probably have very bad classification with an RBF SVM (even feature selection algorithms might mistakenly pick mostly bad features). |
Some generative methods like Naive Bayes assume that features are independent, but discriminative methods like SVM do not make that assumption. SVM's are used for micro-array gene data which have thousands of highly correlated features