11
8

The notion of distance between any two points breaks in high dimensions: distance of a point from its nearest point converges to its distance from the farthest point. It kind of suggests that distance based learning algorithms (e.g., KNN classification) might not work as well in high dimensions. I was wondering if this phenomena is universal for all high dimensional datasets, or are there high dimensional datasets which do not exhibit such behavior and one can safely apply distance based learning algorithms such as KNN on them? I'd appreciate if someone could point me to some relevant literature in this area.

asked Jul 26 '10 at 05:11

spinxl39's gravatar image

spinxl39
3698114869

edited Nov 04 '10 at 02:57

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

waht does "distance of a point from its nearest point converges to its distance from the farthest point" means? can you explain to me ?

(Sep 07 '11 at 08:21) lpvoid

3 Answers:

KNN is a pain to apply in very high dimensional spaces not only because of this distance convergence property (that makes it so you end up needing exponentially many points to represent even a slightly curvy decision surface), but because it is a pain to compute all pairwise distances, and most fast indexing algorithms (KD-trees and many variations piled on top of it) work better with "few" dimensions (if you have more or about as many dimensions as datapoints they probably won't help much unless they already use a dimension reduction method as preprocessing).

But, if you have enough dimensions, why don't you use a (very regularized) linear classifier? Also, SVMs with gaussian kernels end up working in a way that's very similar to KNN (instead of computing all pairwise distances, you compute the distances with regards to the support vectors; instead of using the K closest and vote you use weighted votes that depend on the distances and on the sigma kernel parameter), but they end up working better in very high dimensional situations due to maximizing the margin.

As to your real question, however, I can think of many properties of a dataset that will guarantee KNN works well. For example, if your classes are reasonably well modeled by gaussians, or if they have little overlap, or if the classes lie well enough in a low dimensional subspace, etc. But even a single feature with high variance that is uncorrelated with the classes might screw you, and this is very likely to happen in very high dimensional scenarios.

answered Jul 26 '10 at 08:51

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Jul 26 '10 at 08:56

2

My question wasn't specifically about classification only (e.g., KNN vs other classification algorithms). I meant any learning algorithm that depends on pairwise distances at some stage. So even clustering algorithms such as k-means might also face this problem in high dimensions. So I wanted to know if there are any generic (or even specific, customized for a learning algorithm) suggestions proposed in literature to address such pathological cases.

On a somewhat different note, one of the nice things about KNN is that it can represent non-linear boundaries without having to use kernels. I was wondering if a kernelized SVM too would face problems in high dimensions because computing the kernel matrix also requires computing distances. Well, of course, probably you wouldn't want to kernelize if you already are in a high dimensional space (please correct me if there are cases when you would still want to kernelize high dimensional data) but let's assume if the data is at least "moderately" high dimensional and you want to kernelize your SVM, would the kernel matrix be reliable (and the subsequent decision boundary given by SVM which depends on the pairwise distances or "similarities" of a test point with support vectors in the RHKS)?

(Jul 26 '10 at 10:10) spinxl39
5

Well, about KNN not using kernels, this is not exactly true. Since it only depends on distances, you can always start from any kernel matrix and define dist(a,b) = K(a,a) + K(b,b) - 2K(a,b). The same trick works for k-means, btw. In this sense, both KNN and K-means already kind of use kernels to do what they do, and a lot of preprocessing steps people use with these methods (centering and normalizing the data, or removing correlations between the dimensions) can be seen as specific kernels. To be clear, since it depends on pairwise dot products, it is already using kernels implicitly.

What people usually do in very high dimensional cases, I think, is to use some sort of dimensionality reduction. Usually PCA, random projections, truncated SVD, neural network autoencoders, ICA, etc.

About the reliability of the kernel matrix, this is controlled by the fact that the SVM classifier is highly regularized, which limits the complexity of the models you get out of it, helping generalization. One of the main reasons it makes more sense to use an SVM for this sort of problem instead of a KNN is that the SVM model will be "simpler", even though it also depends on the pairwise distances. It is hard to do complexity control with KNN (you can only tweak the value of K, which might be problematic if your points are not uniformly spaced), and this makes a hard problem a lot worse. But sometimes even SVMs suffer from very high dimensional spaces, specially with kernels (since it's not so easy to ignore irrelevant features if you're in a projected space, which will make your classifier use far more support vectors than it would otherwise need). For this sort of situation a neural network sounds attractive, since you can learn the "support vectors", and avoiding using support vectors (that is, the weights of the nodes) that have high values for irrelevant features (which is impossible for SVMs, since the support vectors are constrained to be from the training set).

(Jul 26 '10 at 10:31) Alexandre Passos ♦

So my point, compressed, is that KNN ends up being just like gaussian SVMs (with the sigma parameter tuned so that the variance usually covers about K neighbors) but with poorer support vector selection, and, hence, complexity control; so, in a sense, it is guaranteed to generalize worse.

(Jul 26 '10 at 10:35) Alexandre Passos ♦

Even when faced with very high dimensions we expect that the data will be highly nonlinear still. So in order to address the curse of the dimensionality pointed out in the question, we can first do some cheap (linear) dimensionality reduction (PCA, random projections...). But to address the issue of nonlinearity a second nonlinear reduction like kernel-PCA should be used. That's what I would do, but isn't there an alternative way to cope with this problem? Also my concern is that the initial PCA uses the pairwise-distances (implicitly), and if these are blurred in the high dimensional space how is the PCA to infer a decent subspace? Because of this I would expect a very slow decay of the singular values in high-dimensional spaces.

(Nov 03 '10 at 16:39) Oliver Mitevski

We tackled exactly this question in R.J. Durrant and Ata Kaban. When Is 'Nearest Neighbour' Meaningful: A Converse Theorem and Implications. Journal of Complexity, 25(4), August 2009, pp 385-397., a prepublication version of which can be found by following the link.

Abstract: Beyer et al. gave a sufficient condition for the high dimensional phenomenon known as the concentration of distances. Their work has pinpointed serious problems due to nearest neighbours not being meaningful in high dimensions. Here we establish the converse of their result, in order to answer the question as to when nearest neighbour is still meaningful in arbitrarily high dimensions. We then show for a class of realistic data distributions having non-i.i.d. dimensions, namely the family of linear latent variable models, that the Euclidean distance will not concentrate as long as the amount of ‘relevant’ dimensions grows no slower than the overall data dimensions. This condition is, of course, often not met in practice. After numerically validating our findings, we examine real data situations in two different areas (text- based document collections and gene expression arrays), which suggest that the presence or absence of distance concentration in high dimensional problems plays a role in making the data hard or easy to work with.

Also you may be interested in my supervisor's recent extension of that work in A Kaban. On the Distance Concentration Awareness of Certain Data Reduction Techniques. Pattern Recognition. Vol.44, Issue 2, Feb 2011, pp. 265-277..

answered Nov 04 '10 at 08:39

Bob%20Durrant's gravatar image

Bob Durrant
316510

edited Nov 04 '10 at 09:56

The cosine similarity (which is simply a dot product, i.e. a linear kernel, if the vectors are normalized with their L2-norm) usually works well with high-dimensional data (e.g. bag-of-words). This is because it measures the similarity of two vectors based on (the cosine of) the angle they form. In other words, two vectors are deemed similar if they point in the same direction. This notion is independent of their distance: two points can be far apart and point in the same direction. The cosine similarity can be used with K-NN and K-Means has been extended to be used with it too (I think it's called spherical K-Means).

Kernelized SVMs also use the notion of pairwise similarity. I prefer them over K-NN because they offer a principled way to pick up the most informative points (they are called support vectors): not all points are useful/relevant for classification. In that sense, K-NN is overkill. However, in my experience, the linear kernel is really hard to beat for high-dimensional data. This is because, if your data is already very high-dimensional, there's no need to project in a even higher-dimensional space. This tends to confirm that the cosine similarity is a good measure for high-dimensional data.

answered Sep 07 '11 at 12:40

Mathieu%20Blondel's gravatar image

Mathieu Blondel
119121615

1

I agree that cosine works in high dim (at least on text data with normalized bag of words repr). However it is directly related to the euclidean distance:

||a - b||^2 = a^2 + b^2 - 2 <a, b> = 2 - 2 * cosinesim(a, b)

So the fact that it works so well is probably not related to intrinsics properties of cosinesim itself but to the some statistical properties of normalized bags of words that might not be generalisable to other kinds of high dim datasets (although I would love to be proven wrong).

(Sep 07 '11 at 17:37) ogrisel

Actually I just read Bob Durrant's answer and he seems to have pushed this reasoning forward :)

(Sep 07 '11 at 17:42) ogrisel
1

Good point! However, they are related only if you normalize the vectors beforehand. This shows that the normalization is the key here. In the case of document classification, thanks to normalization, long documents and short documents are treated equally. Actually, many learning algorithms include the norm of the data in their bounds.

(Sep 08 '11 at 01:55) Mathieu Blondel
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.