5
8

How does one find the intrinsic dimensionality of the data? The trivial case is when the singular values decay quickly, then we can choose the rank in the range of some accuracy. But what if they decay slowly? Does it mean that we can't do much and try to change the assumption. Do we need to use some nonlinear dimensionality reductions, but then they are much more expensive than PCA (SVD by random projections can handle very large matrices). I suppose we should use PCA as a preprocessing step, and then apply fancy nonlinear dimensionality reductions like auto-encoder, t-SNE, etc. But aren't we losing too much (more than just noise) in the preprocessing step, that cannot be recovered? Suppose that the data matrix is very large.

asked Nov 01 '10 at 22:04

Oliver%20Mitevski's gravatar image

Oliver Mitevski
872173144

edited Jan 18 '11 at 22:24

Singular values being similar means that data variance is similar along all linear projections

(Nov 01 '10 at 22:11) Yaroslav Bulatov

Yes, that is clear. The singular values are indication of the shape of the subspace in which the data is embedded. It shows the degree of distortion along each dimension. In the talk from Gunnar Martinsson he mentioned about the power method, but said it requires too many passes over the data matrix. I am not concerned with this constraint for now. Has anyone worked more on the topic? http://videolectures.net/nips09_martinsson_mvll/

(Nov 01 '10 at 22:37) Oliver Mitevski

"Worked on the topic" in what respect? This won't answer your original question at all, but the C++ package redsvd (in-core) and the Python package gensim (out-of-core) implement Martinsson's stochastic SVD algorithm, including the power iterations. And yes, each power iteration means one extra pass over the input.

(Nov 06 '10 at 07:36) Radim

4 Answers:
11

Finding the intrinsic dimensionality of a data set s a fundamental problem. In fact, in real-world data, the intrinsic dimensionality of the data may well be location specific: imagine a Swiss roll and a 3D Gaussian blob embedded in 10-dimensional space. Depending on where on this complex manifold you are looking, the intrinsic dimensionality is two or three. Moreover, it is hard to really put intrinsic dimensionality estimators to the test: for real-world data, ground truth information on intrinsic dimensionality is hardly ever available.

Indeed, the simplest way to get an idea of the intrinsic dimensionality is by looking at the eigenspectrum of the covariance matrix. Note that the estimate you obtain using such a method is an upper bound on the true intrinsic dimensionality: e.g., on the Swiss roll data set, you will find three large eigenvalues, whereas the intrinsic dimensionality of the data is two. You can also look at the eigenspectrum of the kernel matrices computed by nonlinear spectral techniques (LLE, Laplacian Eigenmaps, MVU, etc.). These techniques compute the eigenspectrum after embedding the data in some very high-dimensional feature space: the hope is that the data has become (nearly) linear in the feature space, which would lead to a better intrinsic dimensionality estimate.

A variant of the eigenspectrum approach computes a local covariance at each data point, and looks at the eigenspectrum of this covariance. In the example with the Swiss roll and the boll, such a method would be able to identify the parts of the manifold that are two-dimensional and three-dimensional, respectively.

Another group of estimators uses the intuition that the number of data points in a ball with radius r that is centered on a data point decreases exponentially with the intrinsic dimensionality (or that the volume of the ball increases exponentially to get the same number of data points inside the ball). Examples of such estimators are the correlation dimension estimator and the nearest neighbor estimator. A similar approach is also taken in the maximum likelihood intrinsic dimension estimator. (Some important comments on that method are given here.)

An alternative method based on packing numbers; it uses the intuition that the r-covering number of a data set decreases exponentially with the intrinsic dimensionality.

Some techniques for intrinsic dimensionality estimation are implemented in this Matlab toolbox.

answered Nov 02 '10 at 02:12

Laurens%20van%20der%20Maaten's gravatar image

Laurens van der Maaten
100651324

Thank you a lot. Your comment has been very thorough.

One follow up question or just discussion. On the MNIST dataset I got exponential drop (not high exponent though) until 660th dimension at which point there was a severe drop. The digits were of 784 dimensions. Which means even with linear method it reduced it from 784 to 660. Not much but it's definite. However I am hoping that if the data is heavily sparse like in the vector space models, then this drop should be much higher.

(Nov 02 '10 at 14:14) Oliver Mitevski
1

The MNIST digits have quite a few pixels near that are (almost) always off, so I'm not surprised that 124 eigenvalues are infinitesimal. Note that the intrinsic dimensionality of digits is presumably much lower than the 660 you mention (probably in the order of 10 to 30). I once read a study that estimated the intrinsic dimensionality of face space to be ~100.

(Nov 02 '10 at 18:49) Laurens van der Maaten

What about the auto-encoders. Any comments about how they fit in this picture?

(Nov 03 '10 at 02:07) Oliver Mitevski
2

You can think of auto-encoders as a type of non-linear PCA; they need to maximize variance in their low-dimensional representation to minimize reconstruction error. However, it is unclear how one would use them to estimate intrinsic dimensionality; you cannot do spectral analysis.

Perhaps you could train auto-encoders with increasing numbers of hidden units, and look at when the reconstruction error stops rapidly decaying? That would be expensive though, and I'm not convinced it would work very well.

(Nov 03 '10 at 13:33) Laurens van der Maaten

Is there not a talk now that the data lies on many lower dimensional manifolds than on a single high dimensional manifold and is there not a definitive talk and work done in the machine learning community in making algorithms these days that cater to that? I have been looking for a long time on PCA/Spectral/Deeplearning etc techniques for dimensionality reduction, but perhaps someone can shed some light on my prior query here?

(Nov 03 '10 at 18:57) kpx

@kpx: Yes, there are the approaches called "mixture of subspaces" which make this assumption. If you search for mixture of probabilistic PCA or mixture of factor analyzers, you can find some references.

(Nov 03 '10 at 20:36) spinxl39

Here comes a speculative (maybe totally dumb?) idea. I'm considering NCA (neighborhood component analysis), which learns a transformation A (sticking to the notation in the paper here). After learning A with NCA, if we do an SVD on A, can't we find out about the intrinsic dim of the original data set (A carries a lot of information on the local structure of the data, but also basic global information, so could strike a good balance). I hadn't time to try this idea out. Did anyone else did something like this, any published stuff in this direction? Is this a stupid idea? If so, why? Thanks for any comments.

(Nov 07 '10 at 11:47) osdf
1

I guess what you refer to is that typical manifold learners assume the manifold to be Riemannian, i.e., to be (1) differentiable and (2) behave locally Euclidean. Both assumptions are problematic when analyzing real-world data.

(1) If your data consists of multiple widely separated clusters the manifold is probably not differentiable everywhere. Instead, it consists of multiple (differentiable) submanifolds. Manifold learners like LLE can only deal with each of these submanifolds separately (i.e., they can only embed a single connected component of the neighborhood graph).

(2) If we assume the manifold of faces is 100-dimensional, can we obtain sufficient face images to densely sample the face manifold? The amount of data needed to densely sample grows exponentially with the intrinsic dimensionality of that data; this is sometimes referred to as the "curse of intrinsic dimensionality". In real-world data, we typically do not have sufficient data to densely sample the manifold at hand, as a result of which the local Euclidean assumption may be problematic.

(Nov 07 '10 at 13:04) Laurens van der Maaten

@Laurens van der Maaten: I've always wondered about this curse of intrinsic dimensionality problem. Have you got references to papers that explore what happens (and what works) when the local Euclidean assumption has to be relaxed or abandoned?

(Nov 07 '10 at 19:55) Alexandre Passos ♦
showing 5 of 9 show all

One option is to do set the number of principal components by cross-validation to maximize the likelihood of left-out data, with a probabilistic-PCA model (Ref 1). This is done in the Minka paper on PCA dimensionality choice (Ref 2), in the applications section, although Minka advocates for another option. As can be seen in the Minka paper, the likelihood of left out data often saturates and does not decrease with high dimensions. In this case, the dimension should be set to the break point of the cross validation score.

References:

  1. Tipping & Bishop, 'Probabilistic principal component analysis', J Roy Stat Soc 1999 http://eprints.aston.ac.uk/7361/1/NCRG_97_010.pdf

  2. Minka, 'Automatic choice of dimensionality for PCA', NIPS 2001, http://vismod.media.mit.edu/pub/tech-reports/TR-514.pdf

This answer is marked "community wiki".

answered Jan 18 '11 at 17:23

Gael%20Varoquaux's gravatar image

Gael Varoquaux
92141426

As probabilistic PCA is a linear-Gaussian model, such an approach will (like looking at the covariance eigenspectrum) overestimate the intrinsic dimensionality of the data when the underlying manifold is non-linear.

(Jan 19 '11 at 05:31) Laurens van der Maaten

If the singular values decay slowly then there is likely no low dimensional subspace in the linear sense. You can compare different lower dimensions by comparing the out-of-sample reconstruction error (as in letting some randomly chosen features predict others). You could then check for statistically significant differences in test set performance. Or you could use the model evidence for comparing each subspace representation in the fully Bayesian case. These methods might be easier to interpret than the singular values, but this might be a contentious issue.

You can use the reconstruction error/model evidence approaches to compare intrinsic dimensions in the nonlinear case as well. However, you are only comparing intrinsic dimensions within the context of a particular method. I order to try to infer the intrinsic dimensionality independent of a particular method you would need a more rigorous definition of what intrinsic dimensionality is.

There is another indirect method to prove the existence of a lower dimensional manifold. If you use a data set for a supervised task (e.g. binary classification) and the classifier performs well then there must be a low dimensional manifold (only a handful of dimensions). Consider the following synthetic data experiment, take the unit cube in D=8 space and divide it into a fine grid, sample the data points uniformly in the unit cube, and randomly assign labels to each grid space. There is no manifold here. You will not be able to predict the labels better than if just ignored the features for any reasonable amount of data. The amount of data required increases exponentially in D. In order to predict the labels well, the data must lie on a low dimensional manifold (D=2 or 3 maybe). If supervised method works well in a high dimensional space it proves the existence of a lower dimensional manifold. In this regard Radford Neal has said that D=12 approximately equals infinity.

answered Dec 29 '10 at 19:54

Ryan%20Turner's gravatar image

Ryan Turner
2564812

Interesting comment! What do you mean by "Radford Neal said"? Did he "say" that in a lecture or a talk? Or is this somewhere written down by him?

(Dec 30 '10 at 04:34) osdf

If you are willing to be Bayesian, you could use an automatic relevance determination (ARD) prior (see the Bayesian PCA paper) on the projection matrix. Alternatively, you could use Indian Buffet Process prior on the projection matrix, along the lines of the infinite factor analysis paper which assumes the number of subspace dimensions to be unbounded and lets the data discover the actual number automatically (factor analysis is similar to probabilistic PCA; each factor is akin to a subspace dimension).

answered Nov 03 '10 at 20:06

spinxl39's gravatar image

spinxl39
3698114869

1

If you want go even one step further, check out a recent publication out of lawrence carin's lab: "Compressive sensing on manifolds using a nonparametric mixture of factor analyzers: algorithm and performance bounds". The paper describes how to find the number of mixtures and and the latent dimensions for an MFA in a principled way.

(Nov 06 '10 at 05:10) osdf
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.