|
I've come across different methods for combining features. Most of the time it just involves concatenating the different features together. Sometimes more sophisticated method such as Multiple Kernel Learning (MKL) is used to find the optimal weighting of kernel matrices computed from individual matrices before the combined kernel matrix is used for training a classifier. I've also seen cases where an ensemble of classifiers are trained to produce a single output, sometimes with the classifiers trained on different features. Question 1 My question is when is it better to just concatenate the features together into a (possibly) high dimension feature vector and when is it better to train separate classifiers for the different features and combining their output? The reason for me asking this question is that sometimes some features come as a "group" e.g. the histogram of gradients (HOG) or Local Binary Pattern (LBP) histogram commonly used in computer vision. That is to say, individual numbers in a HOG or LBP feature vector do not have much meaning by themselves. Concatenating features with different scale introduces another problem of having to find the proper scaling method. Furthermore, some features (e.g. HOG and LBP) have huge dimensions. Concatenating them only worsens the problem of working in high dimensional spaces. However, it seems that letting a classifier have access to all the features might allow it to come up with better classification, albeit at the risk of overfitting the data. Question 2 So ultimately which of these methods would people here recommend? Are there any "rule of thumb" or statistical tests/principled approaches for deciding between them? Ideally I would like to avoid the "try all methods and pick the best one". Sometimes there just aren't enough time or resources to do so and having a well thought up and justifiable approach is often much more satisfying. I understand that this is one of those questions where there might be no single right or wrong answer but I just want to hear the opinions of people here on when they have found the different approaches to work and preferably with explanations on the why as well. If there are alternatives feel free to list them as well. Thanks. |
|
What you're describing is sometimes described as "early fusion" vs. "late fusion". Early fusion combines the features, late fusion, the predictions. In general, my experience is that combining classifier predictions works better than appending to the feature vector. There is some theory about why ensemble methods work so well. But take a look at Bagging and AdaBoost papers for some discussion of this issue. I am (somewhat) familiar with Bagging and Boosting but I guess now is the time to take a closer look. Anyway thanks for the reply!
(Jul 25 '13 at 04:43)
lightalchemist
|
|
@Andrew Rosenberg , Hi, I think that using to many features to a classifier can confuse it. So in that case multiple classifier can give better results. Can you hint me the theory behind this? Different classifiers are more or less prone to being "confused" by the inclusion of many features. Decision trees and SVMs are pretty good at ignoring useless features. (L1 and L2) regularization make logistic regression (and other linear classifiers) fairly robust to this kind of noise as well. For density estimating classifiers, take a look at "the curse of dimensionality" for an understanding about why this happens.
(Jan 13 at 08:20)
Andrew Rosenberg
|
|
hi , What is your conclusion about the matter? Which method will be better according to your experimental results? thanks |