2
2

I have done some research into the various dimensionality reduction techniques and have implemented a couple like PCA, t-SNE, Random Projection, Auto-encoder from Hinton and have tried them out on the MNIST dataset with success.

However I would like to use them for semantic indexing, unsupervised clustering of documents as well as semantic analysis, but I am not satisfied with the results. So I believe the problem is with the setup of the data matrix. The features for the documents/words are the words that appear in the document/context, but what scores should I use for each. Binary, term frequency, tf-idf? Any suggestions? Thanks!

asked Nov 01 '10 at 14:13

Oliver%20Mitevski's gravatar image

Oliver Mitevski
753172640


One Answer:

I would suggest to try all of them and rank them according to their ability to preserve the pairwise distances of documents as measured by Pearson correlation using cross validation (treating the feature representation as the hyper-parameters to optimize with the iterated CV / grid-search procedure).

answered Nov 01 '10 at 14:36

ogrisel's gravatar image

ogrisel
398464480

1

You kind of answered another question I wanted to ask as well, and that is how do I validate these dimensionality reduction techniques. For the MNIST dataset I was validating it only by visualizing the 2D embeddings with the corresponding labels. What other ways are being used for validation of dimensionality reductions algorithms? Thanks a lot!

(Nov 01 '10 at 15:17) Oliver Mitevski
2

You can also perform kNN queries on random samples and evaluate the Precision and Recall at Rank R with R=5, 10, 50, ... It means you only consider the R closest samples from your reference in both the input and embedding space and measure the overlap by computing precision and recall on such restricted queries.

This is probably a better measure than than Pearson correlation of pairwise distances for t-SNE as t-SNE does not try to preserve the large distances ratios.

(Nov 01 '10 at 15:27) ogrisel
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.