|
Can one say that a short list of highly weighted features is an indication of a better classifier, than a long list of low weights features when comparing 2 linear classifiers trained on the same training set, without other kind of validation on ground truth test data? Can we make a similar statement about the quality of the training data for a given set of classes? Meaning, if we have two classes, and train the same classifier with training data from the 2 classes, if the model for one class has few strong (high weights) features, and the other one has many weak (low weights) features, can we say that one class is well defined by the training data, while the other is a poorly defined class? |
|
I don't think there is a simple way to answer this question, and it partly comes down to philosophy. Hopefully the below gives you some food for thought. Concerning classification (and particularly discriminative classifiers like linear classifiers), normally one is not interested in characterising the training data but rather characterising the distinction between the classes. In the SVM you can characterise the quality of the classifier by the size of the margin. More generally you are optimising some loss function (typically the hinge loss for SVMs) and you characterise the quality of classifier by this loss. Under this interpretation one would say no, a short list of high weights is not an indication of a better classifier. Now there is the issue of generalisation. You typically don't really care about performance on the test data but rather on some unseen data. The hinge loss of the SVM is more a computational convenience than a true measure of performance. The approach to generalisation in the SVM is to find the maximum margin hyperplane. That is, the separating plane furthest from each class. This done under the argument that it leads to better generalisation. More generally one can invoke Occam's razor and prefer simpler classifiers (Bayesianism is a formalisation of this.) Under this interpretation one would say yes, a short list of high weights is an indication of a better classifier. Your answer hides an important point, which is that a short list of high weights and a longer list of short weights actually are a good indication of an overfitting classifier (since the l2 norm of the classifier is very high, which reduces the margin). Just looking at the weight distribution is a bad idea, she should focus on the l2 norm or something like that if she's interested in generalization performance.
(Jul 22 '10 at 07:21)
Alexandre Passos ♦
|
|
A better thing is to look at the norm of the classifier. Just looking at the weighted feature list gives you no indication of generalization ability, and might fool you both ways: a few highly weighted features might indicate that these features are rare and your classifier is using them to bump a few samples to the correct side, while a lot of short weighted features might just indicate that your classifier has found no real structure in the data and one should not expect good generalization performance from it. The thing you should use to compare two classifiers with the same error rate is their l2 norm (the l1 norm can also be used, but l1 regularization usually leads to worse generalization than l2 regularization, so let's stick to l2 regularization). If they have around the same error rate, the one with smaller l2 norm has (by properties of the euclidean space) the largest margin (and this is how support vector machines were first motivated), and the margin is deeply connected to generalization ability. You can even compare the "hardness" of two datasets by the l2 norm of support vector machines trained on both datasets with around the same hyperparameters. The dataset for which the SVM needs a higher norm is usually understood to be harder (and you should expect worse generalization on it). Also, about your justifications, if you have enough features, a short list of them is also expected to separate any training data, and high weights should indicate that these features do not immediately do it, but the weights had to be pushed far away from zero for the separation to happen. I guess the ideal thing to look for is a few features with small weights, and only that, but this is understandably harder to measure and find. |
A few things that I thought of before asking this question: 2 are in favour of short strong list of features better than long weak list a) with a long enough list of weak features we can potentially explain any random training data; b) l1 regularisation encourages sparse models with a few strong features. 1 bothers me in taking this for truth: c)in AdaBoost for example, many weak simple features combined may be better than a few strong complex features.