|
My problem currently is to use an SVM on a large set of classes. My individual classes are not going to be too big, maybe at most a few hundred elements. But, I will be dealing with tens of thousands of classes. Current SVM multi class methodologies, one-vs-all and all-vs-all seem prohibitively inefficient for such large classes. What is a good alternative to solve my problem? Preferably I'd like to have a solution with code solution already. |
|
Why do you need an SVM? If you use multi-class logistic regression of some sort there should be no problem. If the weights are too large (n inputs, m classes, so that n*m can't be stored), you can factor them so that W=UV where U is n by d and V is d by m with d < min(n,m)
This answer is marked "community wiki".
+1 . lookup softmax regression.
(Jun 23 '12 at 04:26)
rm9
You would still need to store a
(Jun 23 '12 at 12:29)
ogrisel
I'm just following the normal procedure for object classification where each object is represented by a histogram of bags of words. I thought multiclass svm's was the standard procedure for this
(Jun 25 '12 at 14:35)
mugetsu
It is standard but if both your number of classes and your number of input features (number of distinct possible words occurring your bags) are high, then the total size of the one-vs-all global classifier will be large: n_classes * n_features. Assuming one double precision float per parameter that means n_classes * n_features * 8 / (1024 ** 2) MB. This might not hold in memory on a single machine. Hence the hashing trick.
(Jun 27 '12 at 08:36)
ogrisel
I agree with ogrisel. This approach doesn't solve the problem if the number of classes is large.
(Jul 05 '12 at 04:28)
Mathieu Blondel
|
|
You should have a look at the (label x feature)-hashing trick for multiclass, multilabel and/or multitask learning: Transform you data by applying the following: Choose an arbitrary target dimension d (for instance d = 10^6): For each sample i with features vector x^(i) and labels vector y^(i) (labels are expected to be zero-one encoded on n_classes binary vector, possibly using a sparse representation):
This way you can fit a single linear binary classifier (preferably online so that you can generate the hashed samples on the fly) with a bounded number of parameters which is d (+1 for the intercept / bias). The hashing trick also preserves the sparsity of the original representation making it possible to do memory efficient implementations. Weinberger et al's paper: Feature Hashing for Large Scale Multitask Learning is a very nice read that provides both theoretical insights and experimental results at scale. If you want an optimized implementation of this scheme, try vowpal wabbit by the same authors. |
|
I would consider error-correcting output-codes, as they allow you to reduce the original multi-class problem to m binary classification problems (m is chosen by you). In particular, if you set m to some number smaller than the number of classes, you are compressing the problem. If you set it to a larger number, you are adding some redundancy in the problem. See the multi-class module in scikit-learn for a brief description and references. |