|
Pattern classification by Duda, Hart, Stork (Section- 9.6.8) states that a 2-class training set of d+1 or less samples in a d−dimensional space is always linearly separable i.e. if the samples span the d dimensions and not in any sub-space of d. In short, for n<=d+1, where n is the number of samples, there exists a separating hyperplane. How can it be proved? |
|
Ok, so my favourite version of the proof is like this: Excistence of a separating hyperplane is affine invariant. So without loss of generality assume x_0 (the first datapoint) is the origin and x_1,...x,_n are the standard basis, with y_0,..,y_n in {-1,+1} the labels. The hyperplane given by w_i=y_i (i=1,..n) separates the points x_1,..,x_n according to y. The affine hyperplane given by w*x = - 0.5 * y_0 separates all the points :) One question remains, though: Why did Chris Burges use a much longer proof?
This answer is marked "community wiki".
I think you need to say
otherwise if for example x_1,...,x_n were all (nearly) the same, but had different labels, the proof would not be valid
(Aug 29 '11 at 10:03)
Daniel Mahler
What does orthogonal unit vectors mean? Orthogonal with respect to which scalar product? If I choose x_i as a basis, I can use the standard scalar product wrt that basis. Then the x_i are orthogonal, right?
(Aug 29 '11 at 10:06)
Andreas Mueller
1
Basis vectors need not be mutually orthogonal or even unit length. You need to explicitly require that. Dot products are basis independent and vector spaces are not required to have them, in which case length & orthogonality are not defined. Also I think you mean
as the equation for the hyperplane.
(Aug 29 '11 at 12:17)
Daniel Mahler
When I pick a base, I can always define the standard scalar product wrt to that base, can't I? It's the scalar product that has the unit matrix as defining matrix with respect to that basis. Then length and orthogonality are defined and the basis vectors have unit length and are orthogonal.
(Aug 29 '11 at 12:56)
Andreas Mueller
You are right about the sign, I think.
(Aug 29 '11 at 13:32)
Andreas Mueller
Yes, requiring orthogonality is just a formality. You always can define a "standard" product for any basis that will make it orthonormal. Saying "orthogonal" in the assumption just says that that is the one you are using. Otherwise your assumption is formally consistent with using any other product, which is not sufficient for the proof. Equivalently you could just say what product you are using.
(Aug 29 '11 at 14:10)
Daniel Mahler
What I was trying to say is that when I say "unit vector" then I fix an embedding into a R^n. And if I have some fixed embedding and I am not saying something about which inner product I am using, I think it is clear that the standard product is meant. What you are saying is that I should write "Let (1,0) and (0,1) be orthogonal." I think that is all kinds of weired. And how do you mean "products tend to get defined before the basis in more abstract/formal treatments"? Everything gets defined when necessary/convenient afaik. But maybe your background is more abstract than my arithmetic algebraic geometry ;)
(Aug 29 '11 at 14:28)
Andreas Mueller
1
The proof does not say anything about an embedding in R^n, I think saying the the e_i are "cartesian unit vectors" may get you that. I believe "unit vector" just means vector of unit length & says nothing about products or embeddings. Saying e_i * e_j = DiracDelta(i,j) is neither weird nor redundant, The bit of philosophy about products being more fundamental than bases is irrelevant, even if in some sense true. I deleted it a couple of minutes after writing it and did not think anybody actually saw it :).
(Aug 29 '11 at 15:40)
Daniel Mahler
1
Argh... yeah I meant to say standard basis, not unit vector. Maybe that was somewhat of a translation mistake. Then I fixed an embedding. Corrected it. Sorry for the confusion.
(Aug 29 '11 at 15:44)
Andreas Mueller
showing 5 of 9
show all
|
|
It's not for any set of d samples. They have to be in "general position", a technical term which means that no more than d points are cohyperplanar. There is a simple proof in Chris Burges's tutorial on SVMs for pattern recognition, as theorem 1 in that paper. The proof involves convex hulls and the fact that the convex hulls of sets of d points do not intersect in d dimensions. I don't really understand why a poof is necessary. Can't one just use the data points as a basis and choose w = y ? Is there something wrong with this idea?
(Aug 28 '11 at 17:17)
Andreas Mueller
@Andreas: that's the gist of a half-finished proof I have; the problem is that as the data points are not orthogonal it's not trivial that it will always work (for example the points (10, 1), (15, 1), (2, 1); if you want to put the first two in the positive class and the last in the negative class and you assume w = y, w = (10 + 15 - 2, 1 + 1 - 1) = (23, 1), which has positive dot product with the last vector)
(Aug 28 '11 at 17:21)
Alexandre Passos ♦
1
@Andreas: though maybe translating everything one of the points as the origin and then using w = y for the other points and b = epsilon or -epsilon should work, but I'm still not convinced.
(Aug 28 '11 at 17:24)
Alexandre Passos ♦
1
Yeah I forgot about the origin. But after I posted my answer I thought doing exactly as you said in your last comment should work. I'll write it down later on and look if I see a problem. Another way of seeing it would be using one point as origin and solving wX = Y (where X is the data matrix and Y is the label vector for all except the point you choose as origin). X has an inverse and w=Y * X^-1 (and adding y0 * epsilon where y0 is the label of the origin). Btw epsilon only has to be smaller than one for it to work.
(Aug 29 '11 at 01:29)
Andreas Mueller
@Andreas: you're right, this works. Write it out in an answer :-)
(Aug 29 '11 at 04:01)
Alexandre Passos ♦
|
|
If n <= d, apply a translation that makes all the samples linearly independent & thus also non zero. The proposition is only true if such a translation exists. In this case, working in the translated space, you can consistently define a dot product that makes all the samples orthogonal. Let D = (sum of positives) - (sum of negatives). Positives will have a positive dot product with D & negatives a negative one. So the normal hyperplane of D through the origin separates the samples. If n = d+1, you can have the initial translation put one of the samples at the origin, requiring only the remaining d to be linearly independent. Carry out the proof for the remaining d as before. You end up with a hyperplane that separates the d non-zero samples, and one sample on the hyperplane. A small enough shift of the hyperplane not to flip any of the remaining d samples, in the correct direction for the boundary sample, will separate all d+1 samples. |