|
I recently implemented Sequential Minimal Optimization for support vector machines, and now that I am satisfied with its performance on contrived data sets, I'm ready to move up to the real world. My question is what is the simplest kernel for some real world data set? I'm looking at the (very large) list of data sets from UCI's machine learning group, but I'm lost as to which would be appropriate to pick for a simple test run, and what I might use as a kernel. All advice is appreciated. |
|
Linear kernels are surprisingly powerful, often, and allow the SVM to be solved extremely fast. Gaussian kernels are even more powerful, but typically slower for large (n > 10000) datasets. Polynomial kernels are also frequently used, and more powerful than linear kernels. These three should get you started, though for particular applications there are much more complex and powerful kernels (as well as methods for solving SVMs based on them). Right, I understand the various types of kernels, and in fact I combined a number of gaussian kernels centered at different locations to learn a checkerboard pattern. My question is how these kernels are applied to real world data sets, and in particular what kernel is widely known to apply to a specific data set or type of data. My construction was based on the known locations of the centers of the squares of a checkerboard, and that they alternated. However, I have read about kernels applied to text distributions, but those seem very complicated and I'm looking for something simpler to start with.
(Jun 02 '11 at 23:53)
jkun
Maybe I'm surreptitiously asking a deeper question about my misunderstanding of kernel theory: how do I know any of those three kernels will even apply to a particular data set? Excluding parameter selection, there's no reason for any particular representation of feature data to have anything to do with gaussian, linear, or polynomial distributions. What's even worse, I would expect excessive parameter tuning for an unrelated distribution to overfit the training data! How can I reconcile this?
(Jun 03 '11 at 00:05)
jkun
1
First, are you sure you understand what a kernel is? The linear kernel is the dot product of the features of two pieces of data, not between data and a fixed point. This is a crude measure of their similarity. Other kernels are also dot products, but they are dot products of some (often quite complicated) transformation of the features of the two pieces of data they act on. This transformation can capture complex correlations in features of the data. Empirically, gaussian and polynomial kernels perform very very well on a huge number of datasets (especially when those are relatively unstructured - e.g. demographic information in census data, rather than protein shape data) and they are simple to write down. The SVM provides a method that takes this similarity measure and uses it in a theoretically and empirically powerful way (it finds a line in the transformed feature space separating two classes with the maximum possible distance perpendicular from that line to the nearest data point). Kernel space tends to avoid overfitting because of this "maximum margin" line that it finds.
(Jun 03 '11 at 02:08)
Jacob Jensen
Yes, I understand what a kernel is. The whole point of a kernel is to replace the difficulty of the optimization algorithm with the difficulty of finding a kernel which separates the data in feature space. It only makes sense that a good kernel exploits domain knowledge of the problem. Here is, for example, a simple kernel I used to learn the boundary of the unit circle in the plane: K((x1, y1), (x2, y2)) = <(x1, y1, x1^2+y1^2), (x2, y2, x2^2+y2^2)> And this obviously cleanly separates the data via the hyperplane that fixes the third coordinate, and a gaussian kernel would work too. However, if we use the same kernel K to learn the boundary of a circle centered at a different point in the plane, then the convergence of the algorithm is potentially much slower, because we are not incorporating the domain knowledge. My question is, why should I expect a gaussian kernel to work well on a set of real world data, when the underlying distribution could be wildly different? If I get a data set and the linear, gaussian, and polynomial kernels just don't work, am I supposed to throw up my hands in frustration and forget about it?
(Jun 03 '11 at 13:00)
jkun
If you have domain knowledge, then the way to incorporate it is domain specific. An example is object recognition. Pixels are a pretty poor feature space to work on, so you want to transform them and then compare them in a way that recognizes their spatial relationships. The spatial pyramid match kernel does this fairly well: it takes your transformed pixel features (usually something called sift), then compares them between images by comparing the histogram of all feature types, then by breaking the image into quadrants and comparing histograms of features in corresponding quadrants, then break it into a smaller grid and compare, etc. This gives feature comparisons and multiple resolutions and thus takes into account spatial relations without being too restrictive. In a more general domain, if you believe you know a probability distribution your features are generated from, you can fit your model to each sample and find a similarity measure (e.g. based on KL divergence) between the two fitted distributions. Unfortunately, the general recipe is quite broad. My suggestion would be as follows: 1) Try those three simple kernels I named 2) If that doesn't work, find a kernel in the literature for problems similar to yours 3) If that doesn't work try to think of how your data is generated, what the important features and correlations are and how to capture them in a similarity measure, then turn that into a kernel (but keep it simple, for both occam's razor reasons and computational cost) 4) If that doesn't work, try to simplify your problem, apply 1,2, and 3, then try to work your way back to your complicated problem 5) If that doesn't work, ask someone with experience in either kernel methods and/or your domain 6) If that doesn't work, THEN throw up your hands in frustration and forget about it/ try a different method, because even though SVMs are pretty awesome, they aren't appropriate for everything.
(Jun 03 '11 at 13:56)
Jacob Jensen
|
|
Try looking at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/, there are various UCI datasets already formatted for SVMs there and there's various publications describing the performance you should be getting as baseline for those datasets. Great. That's an excellent resource. But it doesn't explain which (if any) kernels are known to work well with a particular dataset.
(Jun 03 '11 at 00:00)
jkun
|
|
Short answer: just use the darn RBF and make sure to look into http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf for model selection. if you then ever have an application-specific problem where you want to get better than that, come back and ask about ideas what might make a good kernel on that specific kinda problem. but you're right, the kernel is the important element to make it truly work well. however, there's no way around expert design here. knowledge has to come in, and the kernel is where it comes in. there is still artistic aspects left in svm usage. background: Gaussian kernels make an SVM a universally consistent learning machine: "Support vector machines are universally consistent, I Steinwart, Journal of Complexity, 2002." In the JMLR-version of Cristianini's kernel-target-alignment paper, there is a section on "no free kernel" -- i've seen a better elaboration on that, but it eludes me where ATM (was it the book chapter version of kta? -- need to think..) Maybe you'll also enjoy "On Relevant Dimensions in Kernel Feature Spaces Mikio L. Braun et al., JMLR 2008". but again, it seems all you want is to use an RBF until you've come across enough works using alternative kernels to judge when which alternatives exist or suggest themselves. |