|
I originally started with the goal of trying to understand Hilbert Space Embeddings of Hidden Markov Models, but slowly devolved into reading A Spectral Algorithm for Learning Hidden Markov Models and then realized I really don't know what Hilbert Spaces are or how they are used. After a bit more reading A review of RKHS methods in machine learning and From Zero to Reproducing Kernel Hilbert Spaces in Twelve Pages or Less came to my attention, but even after that I'm afraid I'm still rather unclear. I understand that all of this forms the basis of why we can replace the standard inner product used in R^n with another symmetric, positive definite kernel of our choice, but what exactly go into these RKHSs? What are its elements? What is the "magic" allows us to take our input, transform it in some way, and use this new dot product in another space? |
|
This magic is called the representer theorem. It, roughly, states that, for some optimization problems, even if you replace dot products between vectors to dot products in a projected hilbert space (of possibly infinite dimension), given some reasonable assumptions on the structure of these dot products (that ensure that there is a projected space where that dot product makes sense), you still have an optimal solution. A kernel, then, is the function K(x,y) = phi(x) dot phi(y), and as long as it is positive definite you will have no problems. This is the view of kernels in SVMs. It means that, for example, you can keep the exact same structure of the SVM optimization problem (minimize ||w|| + C/n sum(eps_i) subject to y_i(w dot x_i + b) < eps_i) and get the best solution in a very high dimensional space. For some kernels (such as gaussian kernels), all classes are linear separable (except for classes that have measure zero according to some reasonable distribution), so the kernel trick allows you to pick the "most regularized" vector that separates any two classes to any degree of accuracy (as measured by C, roughly, see the nu-svc paper for more information on how to make this more rigorous). So this means that, with sufficient data, SVMs could produce almost any optimal classifier keeping optimal generalization performance (according to VC theory). For gaussian processes, kernels are a bit different, and they are related to the covariance function of the underlying gaussian processes, but the intuition is still that K(x,y) represents some measure of similarity between x and y. It is slightly tricky, but the main idea is that kernels allow you to get optimal solutions to optimization problems even in infinite-dimensional spaces for some classes of regularized loss minimization problems, that turn out to be very useful. This is mainly interesting because of the optimality and generalization guarantees, both of which are very important (and if you ever trained to death an unregularized neural network only to find local optimum after local optimum and in the end still have poor generalization performance, you will understand why having these two characteristics together is tempting; most neural networks today have incorporated these insights). Of course, kernels have their problems; they are shallow and might need a lot of training data to represent highly varying functions (see the scaling learning algorithms towards AI paper). Thanks Alexandre, but what I would really like to know, I suppose, is the proof of what you just said. Also, though I see the word "infinite dimensional space" thrown around a lot, I'm not quite sure what that means either. Basically it's saying that instead of characterizing something by a vector in R^n, you need to characterize it as a function as there are an infinite number of values depending on the "index" you are looking at?
(Aug 07 '10 at 05:19)
Daniel Duckwoth
1
Ok. This Williamson et al paper http://users.rsise.anu.edu.au/~williams/papers/P139.pdf has a proof of a generalization of the representer theorem that you might like. About infinite dimensionality, let's go by steps. Assume we have two dimensions (x1,x2) and we use a polynomial kernel (1 + x·x')^2. Expanding this expression, we find that K(x,x') = (1 + x1 x'1 + x2 x'2)(1 + x1 x'1 + x2 x'2) = 2 + 2 x1 x'1 + 2 x2 x'2 + x1^2 x'1^2 + 2 x1 x'1 x2 x'2 + x2^2 x'2^2. This is equivalent to defining phi(x) = (sqrt(2), sqrt(2) x1, sqrt(2) x2, x1^2, x1 x2, x2^2) and saying that K(x,x') = phi(x) dot phi(x') (if there's anything wrong it's my mistake, I'm doing it from memory on a hangover, but you should be able to fix it). So what this specific kernel does is compute a dot product in a space with more dimensions than the original space. This allows the classifier to implicitly learn decision functions that depend on variables not allowed to a linear classifier, such as x1 x2, or x1^2, without ever having to explicitly represent a weight vector w in the projected space, since the representer theorem guarantees that the solution w to the SVM problem (and similar others) can always be expressed as a linear combination of points in the training set (hence, if w = sum_i alpha_i x_i, by the reproducing property, K(w,y) = sum_i alpha_i K(x_i, y)). So you should now understand that it can operate in a space with higher dimensionality than the original. The nice thing about hilbert spaces is that you can say that any reasonable kernel (where reasonable is something like positive definite) implicitly defines a projected space. Some calculations can show that for the common gaussian kernel, for example, the dimensionality of the projected space is infinite. So, given enough data, the classifier should be able to learn a weight vector w that depends on an infinite number of combinations and powers of the variables in your training vectors, without ever needing to represent this vector explicitly. This means that there is no limit to what a support vector machine can learn, given enough data, since it can potentially put a weight on almost any possible nonlinear feature of the training vectors, and yet it has guaranteed generalization performance according to VC theory. Things are of course not so rosy, but it is still very interesting and powerful.
(Aug 07 '10 at 07:45)
Alexandre Passos ♦
So if an SVM with Gaussian kernel can learn any nonlinear boundary separating the data, does this mean that the training error would be zero with a Gaussian kernel? I guess not since the optimizer tries to balance the trade-off between the margin and the slack (misclassifications). So maybe some training points may still be misclassified?
(Aug 07 '10 at 11:59)
ebony
Yes, for certain values of C (very high), the training error will be zero on any non pathological dataset (unless there are two exactly equal points with the same label, that is). The point of the SVM, as originally proposed, was to fix the training error to zero and maximize the margin, minimizing a bound on the generalization error of the classifier. The extension to non-zero training error was added later, to prevent overfitting.
(Aug 07 '10 at 12:04)
Alexandre Passos ♦
@prat Also, if you read the nu-svc paper I linked to in the answer, you will see that a simple reparametrization of the SVM allows you to specify a priori the training error you want, and will find the best classifier with that exact training error, if any classifier with that training error exists.
(Aug 07 '10 at 12:06)
Alexandre Passos ♦
Also if we do the dot product directly viz x.x = (x1.x1+x2.x2) then computing the other more complex kernels (1+x.x)^2 is done by just squaring the entries of x.x rather than computing the entire thing like x.x = (sqrt(2), sqrt(2) x1, sqrt(2) x2, x1^2, x1 x2, x2^2). Not that Alexandre missed that point, cause he was trying to explain the dimensionality thingy.... But this is also a very important point as most kernels are formed by doing such inexpensive ops. This saves us more computation time and yet make powerful kernels with less extra cost...
(Nov 05 '10 at 01:57)
kpx
showing 5 of 6
show all
|