|
As far as I know, there are three ways to try to understand a positive definite kernel, by looking at
My question is: Is there an overview over these three aspects for commonly used kernels? For the linear and polynomial kernel it's pretty simple, I guess. But how about intersection, chi squared, hellinger and exponentiated variants of these? How about tanh and exponential? Does anyone know where to find this kind of information? |
|
Regarding (2) it should be possible to always get a maybe not very satisfactory mapping. If you replace, say, the tanh in the tanh kernel with its taylor expansion you get a mapping that always works. As in, every kernel that looks like f(x dot y) for sufficiently nice f can be written as f(0) + f'(0) (x dot y) + (1/2) f''(0) (x dot y)² + ..., which is an infinite linear combination of polynomial kernels with the coefficients chosen by the derivatives of the f function? Maybe there is a way of proving from this that certain many-variable coefficients should be zero, but I doubt it. If so this gives a way of understanding these nontrivial kernels by seeing them as mixtures of polynomial kernels. Each specific function f will then pick different weights to the polynomial kernels according to how strong is each partial derivative. You can then understand how the decision boundary should behave by looking at truncated aspects of this taylor expansion. For example, the sigmoid kernel then uses only odd polynomial kernels in its expansion, so it effectively gives three-way interactions more weight than pairwise interactions between variables (and likewise odd-way interactions more weight than even-way interactions). Also, the further you get from zero (that is, the higher the absolute value of x dot y) the more important the nonlinear interactions get, so it is effectively a linear kernel unless the dot product between the vectors is big enough. This als o suggests that the decision boundary will look very linear in parts of the space that are orthogonal-ish to the support vectors, but can look very complicated in the subspace spanned by the support vectors, and indeed it seems like a point could conceivably switch class if you just increase its norm. Actually most kernels are not a function of x dot y, right? At the moment I can only think about linear and tanh. Also, I am not quite sure what the function of the dot product tell you about the decision function. In a way you are pointing at something that I overlooked in my question: The norm on the space of the functions. I would guess (though I am not sure), that the RHKS for the tanh and rbf kernel are the same, but equipped with different norms. (both should be the set of smooth function, shouldn't they?)
(Aug 19 '11 at 11:43)
Andreas Mueller
Yes, they both map to the set of smooth functions but with different norms. Even the rbf can be written as f(x dot y) if (as usual) x and y are normalized, as it becomes exp(- 0.5 * (x dot y)**2). A polynomial kernel is obviously (1 + x dot y)**k, which is also f(x dot y). The chi-squared kernel as you say doesn't fit the mold, although if x and y are normalized it almost reduces to the gaussian kernel, although the dimensions are weighted oddly (I guess the point of it then is to normalize in a more clever way? I've never used it). I'm thinking what is the best way to generalize this argument; maybe looking at the taylor expansion of f(x,y) (although this doesn't lead to such a simple expansion in terms of polynomial kernels still maybe something interesting can be said)?
(Aug 19 '11 at 11:56)
Alexandre Passos ♦
Let me be clearer on my earlier point about the decision function. Any decision function is just sum_i alpha_i K(x, v_i), for support vectors v_i and dual coefficients alpha_i. If you then insert the taylor expansion of K above you will have sum_i alpha_i (f(0) + f'(0)(x dot v_i) + (1/2)f''(0)(x dot v_i)² +...). So you can see that, for the sigmoid for example, if x dot v_i is small then the linear term dominates, so for points that are far from the support vectors and for points near the origin it's essentially a linear decision function, but as x goes away from the origin in the direction spanned by any support vector you start seeing nonlinearities, which suggests that there is a sweet spot as far as this kernel is concerned on the norm of x (as beyond that you can hage terms with ridiculously huge exponents dominating for no clear reason, which will make for an oversensitive decision boundary, possibly even wiggling).
(Aug 19 '11 at 14:33)
Alexandre Passos ♦
|
Also, it would be interesting to see a similar treatment to nonstandard kernels like string or graph kernels.
Indeed, if, say, string kernels define an inner product between strings, what is the meaning of, for example, a quadratic kernel between strings? Would it help?
Definitely :) I am just pondering some graph kernels. For these, often the embedding is given explicitly.
3) Seems a bit hard on graph kernels, since the space is discrete (if labels are discrete). One could try to find moves in the space that the decision functions can't detect.
A relevant paper (but you probably know it) is "Input space versus feature space in kernel-based methods" by Schölkopf and his colleagues.
Actually, I did not. Though you are right, I probably should ;)