|
I know that the idea of using a kernel in SVM is to transform the datapoints into an infinite-dimension space where the points can be linearly separable. This way then we can find a maximum margin that separates the points. But then why do we need to use a soft-margin if we are able to separate all the points?! As far as I know is that the idea behind the soft-margin is that if we are not able to to fully separate all the points then we find the best possible margin. So if we are using the kernel function the entire idea of soft-margin makes no sense to me. So what's the idea?! |
|
It is true that some kernels like the Gaussian can map the data in an infinite dimensional space, so any problem becomes linearly separable. This is actually one of the reasons for soft margins. The problem is overfitting, since by the above argument the VC dimension of linear classification in the target space is equal to the the number of points, so models which rely on this are unlikely to generalize well. Soft margin in conjunction with a regularization term reduces the VC dimension and thus improves generalization. That is the key behind practical success of SVMs, you can adjust the VC dimension to your problem's complexity. The function of the C or $nu$ parameter is to control the VC dimension. The $gamma$ parameter of the Gaussian kernel also affects the VC dimension so they are not independent. Even though the kernel maps data into an infinite dimensional space, any finite data set will only span a finite subspace and the transformed datapoints can act as basis vectors for that subspaces. In fact the kernel trick is really a change of basis where the support vectors are the basis vectors. A trained support vector machine is really a cove vector expressed into terms of the dual basis to this The answers here and here may also be of interest. |
|
The kernel doesn't guarantee that the classes will be linearly separable in feature space. Real data is noisy and usually never perfectly separable. |