In Knowledge Discovery with SVM maximum-margin classification is introduced and the objective function is given as 1/2 times the norm of the normal vector for the hyperplanes. This threw me for a loop since given a normal vector for a plane we can always obtain a normal vector in the same direction with any given norm. So on first site, the objective function seems quite meaningless. Perhaps coupled with the constraints we give it meaning. However, I'm still trying to wrap my head around the idea of maximizing a function over something as arbitrary as the norm of a vector when we are considering nothing more than the rotation of a hyperplane, where the norm of the normal vector shouldn't matter.

Any elucidation of this point would be appreciated. Perhaps in relation to convex optimization generally which I'm new too. Is this something that occurs in various context and how should be it be approached for better conceptual understanding? This seems a bit fine of a point so hopefully this make sense to someone.

asked May 20 '14 at 18:12

TheOldHag's gravatar image

TheOldHag
6335


2 Answers:

No, the normal vector isn't chosen arbitrarily. Instead, it is picked so that it's easy to compute the objective function as follows:

Given the training set {(xi, yi)} where yi = ±1 for i = 1, ..., N.
If we assume that the linear decision boundary has the form <w, x> + b = 0, the Euclidean distance from xi to the boundary line will be yi(<w, xi> + b) / ||w||   (where the value (<w, xi> + b) / ||w|| is the signed distance).

Note that if we scale the parameters using w → kw and b → kb, the above distance isn't changed. Therefore, we can pick the scale so that ys(<w, xs> + b) = 1 for the points xs closest to the decision boundary (i.e. support vectors).

The goal now is maximizing 1/||w|| and that is equivalent to minimizing ||w||2/2 (where 1/2 is added just for convenience) with the constraints yi(<w, xi> + b) ≥ 1 for i = 1,..., N

answered May 20 '14 at 22:38

Vinh%20Khuc's gravatar image

Vinh Khuc
863

edited May 21 '14 at 11:39

I wasn't seeing that the norm of the normal vector was all about calculating the distance to the supporting hyperplanes. For some reason I had all that wrapped up in the determination of the offset term b and was too focused solely on the rotation of the hyperplanes. I guess this explains why b need not even appear in the objective function. Thanks.

(May 21 '14 at 05:55) TheOldHag

I appreciate Vinh's input. This morning I sat down and took another look at the derivation for the original objective function of maximum margin classification, which is basically minimizing a norm squared, and I think I have a good intuitive approach to understand this. The key is the optimization was flipped from maximization to minimization. We are minimizing a norm based on the constraints being all the points in the training data - support vectors in particular. Furthermore, the constraints involve dot products of the normal vector and support vectors. So it is true we can take this normal vector and scale it to whatever size we want but that is prevented by its use in the constraints in the context of dot product having to be greater than some value. So we minimize it as much as possible with respect to our constraints and that gives us our maximum-margin having supporting hyperplanes containing some points in the training data directly on them.

answered May 21 '14 at 10:18

TheOldHag's gravatar image

TheOldHag
6335

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.