3
1

I'm looking for a dimensionality reduction tool for visualization:

I have N data points, each of which has a k-dimensional vector of feature values. What are some good techniques and tools to reduce the feature dimensionality down to 2 features, so that all N data points can be nicely drawn on a map?

asked May 04 '11 at 03:24

Frank's gravatar image

Frank
1349274453

edited May 04 '11 at 06:43

ogrisel's gravatar image

ogrisel
498995591

1

Try PCA first, before anything else. If your data is simpler than you think, this might to a very good job of mapping it. It's also about 2 lines in matlab, or any machine learning library.

(May 05 '11 at 00:55) Jacob Jensen

But someone here said that PCA is only good if I have about <10-dimensional feature vectors. Mine are pretty large and can be in the order of thousands or a million.

(May 05 '11 at 03:42) Frank

If they are highly correlated this might still give you an interesting viz. If you have a milion features, you need a PCA lib that supports sparse inputs such as this implementation.

(May 05 '11 at 05:51) ogrisel

PCA gives you the fraction of variance for the first n eigenvectors, e.g. % total variance: [ 64.6 81.7 89.4 ... ], so look at this curve for your data. (1000 raw? features sounds noisy though.)

(May 09 '11 at 07:13) denis

7 Answers:

If k is not very high (say less than 10) or highly correlated and simple projection on the first 2 components of a PCA might do.

Otherwise you can try Multi Dimensional Scaling (MDS) or t-distributed Stochastic Neighbors Embedding( t-SNE).

If you expect your data to live on a single fully connected manifold, Locally Linear Embedding (or Hessian-LLE) might work well.

answered May 04 '11 at 03:43

ogrisel's gravatar image

ogrisel
498995591

1

t-SNE should be tried first because there is code on the web for it (http://homepage.tudelft.nl/19j49/t-SNE.html) and it seems to dominate the other methods you mention in my experience

(May 07 '11 at 02:19) gdahl ♦

There is code for the other methods in every machine learning library ;)

(May 11 '11 at 10:53) Andreas Mueller

The fastest, ugliest and simplest way I can think of for that kind of problem would be Self Organized Maps.

Self Organized Maps are designed to do exactly what you want, they do a projection from an N-Dimensional Space to a 2D map, while preserving the topology of your data, so nearby vectors are still grouped together.

It is rather easy to implement and there are a lot of good libraries written for it.

answered May 04 '11 at 07:00

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128

Adding to this, there was a paper from Hinton's group not too long ago along these lines. They basically used a combination of a DBN and PCA, as I recall. I wish I could say which paper it was, who the author is, or even when it was published ... it's kind of a blur with how many I've sifted through.

(May 13 '11 at 14:23) Brian Vandenberg

I'd try everything ogrisel suggested and also possibly isomap (when you expect your data to live on some manifold - but otherwise: how would you project it down anyway?) You can also try Kernel-PCA or any other dimensionality reduction method like (kernel) ICA or Factor Analysis. About t-SNE: This is from Hinton's lab and I think there is some magic involved. Hinton talked about it at last NIPS and used phrases like "... and then I let the probabilities sum to 4. The unnormalized probabilities summed to two. I tried to normalize it but then it didn't work as well so I went in the other direction." There is a fairly detailed discussion of t-SNE in this thread.

answered May 04 '11 at 07:54

Andreas%20Mueller's gravatar image

Andreas Mueller
2686185893

edited May 11 '11 at 10:36

2

Hehehe I think he just says that because he knows it will annoy the folks at NIPS. Having probabilities that sum up to 4 in the early stages of the optimization is not at all required to get good results with t-SNE. Admittedly, there are quite a few tricks involved in getting into good local minima of the t-SNE cost function (none of which you have to bother about much when using the online implementations).

(May 04 '11 at 10:51) Laurens van der Maaten

Could be. I have never tried it myself, he just gave the impression that there is lots of magic.

(May 11 '11 at 10:52) Andreas Mueller

This isn't exactly what you asked for, but depending on what you need you might try Parallel Coordinates. They can be useful for identifying regions in feature space where points clump together, especially if you can represent different groups (classes, clusters) with different colors. There are weaknesses, of course (for one, the ordering and scaling of the axes can impact how easy/difficult it is to draw conclusions) but I believe it is a pretty popular technique for visualizing high-dimensional data. EagerEyes has another good overview.

Hope this helps.

answered May 05 '11 at 23:10

Troy%20Raeder's gravatar image

Troy Raeder
89972025

As an alternative to SOM, see the Generative Topographic Maps.

answered May 11 '11 at 02:50

Lucian%20Sasu's gravatar image

Lucian Sasu
513172634

What is wrong with generative topographic maps? There is a free implementation contained in the toolbox Netlab for Matlab.

(Jun 17 '11 at 09:29) Nikos G
-2

Hi Guys,

Just to parasite to this thread and ask a quick question, in that scenario, let' say we have a 100 snippets (each with around 100 words), how do we determine the top k (say 10) important terms from that collection?

Regards, Andy.

answered May 12 '11 at 02:31

cherhan's gravatar image

cherhan
189222529

Tf-idf (http://en.wikipedia.org/wiki/Tf%E2%80%93idf) maybe? Why don't you open a separate question for this?

(Jun 08 '11 at 16:47) Frank
-3

Hi Guys,

Just to parasite to this thread and ask a quick question, in that scenario, let' say we have a 100 snippets (each with around 100 words), how do we determine the top k (say 10) important terms from that collection?

Regards, Andy.

answered May 12 '11 at 02:32

cherhan's gravatar image

cherhan
189222529

Please post a new question. It helps everyone.

(Jun 09 '11 at 00:27) Robert Layton
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.