I want to implement an algorithm in a paper which uses Kernel SVD to decompose a data matrix. So I have been reading materials about Kernel methods and kernel PCA etc. But it still is very obscure to me especially when it comes to those mathematical (linear algebra) proof and formulae. I have a few questions regarding Kernel -based methods and kernel PCA which are very basic because I am a beginner.

  • Why kernel methods? Or, what are the benefits of kernel methods? What is the intuitive purpose? Is it assuming a much higher dimensional space is more realistic in real world problems and able reveal the nonlinear relations in the data, compared to non-kernel methods? According to the materials, Kernel methods project the data onto a high-dimensional feature space, but it needs not to compute the new feature space explicitly, it only need to compute the inner products between the images of all pairs of data in the feature space. So why projecting onto a higher dimensional space?
  • On the contrary,SVD reduces the feature space. Why they do it in different directions? Kernel methods seek higher dimension, while SVD seeks lower dimension. To me it sounds weird to combine them. According to the paper I am reading, introducing Kernel SVD instead of SVD can address the sparsity problem in the data, improving results.

asked Apr 21 '14 at 13:24

xiaoyu's gravatar image

xiaoyu
1111


One Answer:

Like you say, kernel methods are useful for transforming data to high dimensions, non-linearly, in the hope that this transformation will make the data "linear" (lie close to linear manifold, or be linearly separable in case of classification). Kernel trick is a separate issue, that allows you to compute the kernel cheaply using only the inner products without having to explicitly compute the kernel function. But you still get the same kernel, just computed differently.

The point in kernel PCA is that once you've transformed to high dimensions, the data might be easily explained by a few directions (principal components). So it's not weird to combine, it's just looking for a simple explanation in non-linear feature space instead of in linear input space.

That said, I haven't seen an example in practice where kernel PCA gives you advantage over plain linear PCA, but there's no harm as RBF kernel with decent hyperparameters can capture anything a linear kernel can.

answered Apr 24 '14 at 09:49

digdug's gravatar image

digdug
245111620

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.