|
I stumble upon t-SNE frequently these days. What are the main advantages of t-SNE compared to other classical dimensionality reduction methods (say PCA) What is the objective of t-SNE? |
|
t-SNE is a non-linear dimensionality reduction technique that is targeted towards the 2D or 3D visualization of high dimensional data. t-SNE can somewhat achieve this by trying to respect the local structure while not trying to reproduce the global structure: 3 points that are close in the original high dim space should have there pairwise distances preserved in the low dim t-SNE embedding, while 3 points that are far away in the original high dim space can be projected to completely unrelated locations in the 2D low dim embedding. If you only keep the first two eigenvectors as a linear projection basis for visualization, you will loose most of the variance and there is no guarantee that the local structure to be preserve any better than the global structure. For instance on the MSNIST dataset, using PCA will never give you an 2D projection that looks like the result of page 18 of the original t-SNE paper. Actually the paper on the parametric variant of t-SNE (that uses a deep belief network to learn the non linear projection function), there is comparison on page 7 of 2D visualization of the MNIST dataset using PCA and parametric t-SNE. |
|
t-SNE tries to find a non-linear mapping between high dimensional space and low dimensional space that preserves distances between pairs of points. By comparison, PCA will only find a linear mapping, and its objective is to find a subspace where the projection of each data point lies as close to the original point as possible. This means that clusters that are separated in high dimensional space that may be merged by PCA can still remain separate in low dimensional space when t-SNE is applied. One obvious con of t-SNE relative to PCA is the cost of applying it. For a large dataset it may take several hours to run, while PCA is usually more or less instantaneous on modern hardware, even for fairly large datasets. Also, I don't think there is any algorithm guaranteed to find the global minimum of the t-SNE objective, so it doesn't define a unique visualization of the data, which might be important for some applications. One con of t-SNE relative to other nonlinear dimensionality reduction methods is I think that it is mostly justified for very low dimensional spaces, like 2 or 3 dimensions, so it is more suited for visualization than for learning features to put into a pattern recognition algorithm. Another thing you may want to look into is whether t-SNE focuses more on preserving local or global structure of the distribution. Different nonlinear dimensionality reduction algorithms focus more on on than the other. I would guess that t-SNE does more to preserve the local structure but with the heavy tails I'm not sure. |
|
t-SNE is a non-linear (specialized) dimensionality reduction technique used for visualizing high dimensional data in low dimensions (2D or 3D). It's actually a follow up of an earlier technique called SNE (stochastic neighborhood embedding) that uses the idea that the conditional probability of a point i being close to point j should stay the same in both the original high dimensional space and the low dimensional embedding. t-SNE improved SNE by preserving joint probabilities instead of conditional probabilities which gives the symmetry property for pairwise probabilities. Another improvement of t-SNE over SNE is that t-SNE uses heavy-tailed distribution centered at each point to define the neighborhood, in contrast to SNE which uses Gaussians at each point. The use of heavy tailed distribution avoids the "crowding problem" encountered by other 2D/3D projection techniques. Roughly speaking, is gives "enough room" for distant pairs of points so that they have enough space in the 2D/3D space to be embedded reasonably. In terms of the cons, t-SNE isn't a general dimensionality reduction technique that should be used as a preprocessing step for other learning tasks such as high dimensional classification or clustering. So it's typically only used as a visualization method. Also, another downside is that the optimization problem posed by t-SNE is non-convex For comparison with other approaches and more details of the method, you can read the original t-SNE paper.
Actually symmetric SNE predates t-SNE; t-SNE's main contribution, as you mention, was the heavy tails, which appear to be important to getting good behaviour on both smaller and larger scales.
(Oct 12 '10 at 18:12)
David Warde Farley ♦
|
|
I think the pros of t-SNE have already been appropriately discussed above. With respect to the cons discussed above, I would like to remark the following. Indeed, the standard version of t-SNE is very well suited for embedding in two or three dimensions (i.e., visualization), but not for embedding in spaces of higher dimensionality. If you embed in, say, a 30-dimensional space, the crowding problem becomes less severe, so you don't need the heavy tails of the Student t-distribution. If you want to use t-SNE to embed in higher dimensions, you should increase the number of degrees of freedom of the Student t-Distribution. This decreases the thickness of the tails, and allows t-SNE to fully exploit all the space that is available in 30 dimensions. Strategies for adapting the degrees of freedom of the t-distribution are discussed in the paper on the parametric variant of t-SNE. |
|
In addition to the pros and cons mentioned, I'll throw in one more con (and really, this is more of a theoretical con than a practical one): optimizing the objective function in clever ways can be a real pain. The t-SNE paper outlines the algorithm in terms of gradient descent with momentum, and I seem to recall that the code incorporates the delta-bar-delta scheme for parameter-specific gain adaptation (Jacobs et al, 1988). It turns out it's rather hard to beat these old standbys with ready-made optimizers like conjugate gradient, BFGS, etc. (I played with this for a course project, and even using the Hessian didn't seem to help!). My best guess is that there are a lot of really bad fixed points of the objective function (not surprising since the number of tuneable parameters scales linearly with the number of points visualized) and you really need at least the momentum term to escape from them and get into a relatively good region of parameter space. EDIT: Laurens van der Maaten (original author of t-SNE) responds in a comment below with a clever insight into a search direction that works much better than the stuff I tried. 6
Actually, there is an optimization algorithm that converges much faster. To explain how it works, I will first rewrite the t-SNE gradient a bit. Denoting the low-dimensional embedding by a NxD matrix Y, the graph Laplacian of the high-D similarity matrix P by LP, and the graph Laplacian of the low-D similarity matrix Q by LQ, you can write the t-SNE gradient as 4 .* (LP - LQ) * Y. (Btw... this way of writing down the gradient facilitates fast implementation of t-SNE in CUDA.) If you set the t-SNE gradient to zero, you can investigate various splits in an attempt to find a fixed point iteration. Unfortunately, none of the splits leads to an update rule that is guaranteed to converge (i.e., you don't get a fixed point iteration). However, one of the splits does suggest a search direction that works very well in practice, viz., Y - (LP (LQ * -Y)). You can prove that this search direction never becomes orthogonal to the t-SNE gradient. As a result of Zoutendijk's theorem, this implies you are guaranteed to converge to a local minimum if you use the above search direction with a linesearch that satisfies the Wolfe conditions. In practice, you can use the above search direction to construct an optimizer that produces good results in as little as 15 parameter updates. The solutions are not better than those by the gradient method, but the computational gains are significant. I'll try to post some code later this week.
(Oct 13 '10 at 13:04)
Laurens van der Maaten
Laurens -- that's really good to hear, thanks for sharing! I never really thought about it in that light or played much with the functional form of the gradient. I'm eager to try it out when I get some time.
(Oct 13 '10 at 13:59)
David Warde Farley ♦
2
There are two small errors in my comment above. The backslash in my comment above did not parse correctly. The correct search direction is: Y - (LP backslash (LQ * -Y)). Also, LP should be the graph Laplacian of (P x (1/(1+D^2))) and LQ the graph Laplacian of (Q x (1/(1+D^2))), where D is the pairwise Euclidean distance matrix in the map and x is an element-wise product.
(Oct 13 '10 at 18:10)
Laurens van der Maaten
2
There is some code online for the t-SNE optimizer discussed in this thread (under "New t-SNE Optimizer").
(May 04 '11 at 10:49)
Laurens van der Maaten
|