|
I am trying to do dimensionality reduction, either via feature selection or a projection into a different basis given an input space with many high-dimensional instances. Something like PCA would probably do great, but i grow impatient waiting to do e-value decomp every time i want to run an experiment. Random projections are very fast, but the end results aren't very promising. Anyone have suggestions how to bring the dimensionality of a high dim (text, for instance) way down but still maintain as much info as possible? |
|
If your data matrix is sparse (which I'd expect for a text corpus) and you don't want to do full eigen-decomposition, you can try using Lanczos algorithm which is an iterative algorithm for finding the top K eigenvectors. Matlab has it implemented via the eigs/svds functions that can give you top K eigenvectors (or even smallest ones if you need). There also exist a number of other packages such as SVD/LSA which you could use. Besides, if you want to try a probabilistic version, there is the expectation maximization (EM) algorithm based latent semantic analysis which can be reasonably fast. Here is a matlab implementation. Even PCA can be solved using EM. See the Wikipedia entry on PCA. Another option could be to use kernel PCA which has two-fold benefits: (1) it can learn nonlinear projections and (2) it can be very efficient if the number of features D far exceeds the number of examples N because it is based on the eigen-decomposition of the NxN data kernel matrix rather than the DxD covariance matrix. thanks, i should have mentioned that i did implement the lanczos algorithm (in java though) with some optimizations for sparse data. i could probably make it a bit faster for sparse data, but asymptotically the same, and i doubt it'd make much of a difference. kernel pca sounds like an option- though i'm not really sure i understand how it works. even if i'm willing to treat it as a black box, which i am, i dont know if i could implement it quickly. i'll try and hunt down a resource here. p-pca is one possibility, i'll try to code this up and see if it's any faster. what about lda, or probabilistic clustering techniques?
(Oct 10 '10 at 10:22)
downer
LDA is far slower than SVD.
(Oct 10 '10 at 11:46)
Joseph Turian ♦♦
@Joseph Turian: except that you can get quite a good approximation for the LDA problem on two passes over the data with collapsed gibbs sampling, while even state of the art methods for SVD tend to need a bit more
(Oct 10 '10 at 13:26)
Alexandre Passos ♦
1
The random projection techniques only take a few passes over the data (this is total passes, not iterations). For most uses in machine learning, these methods are much faster than LDA, but perhaps not as appropriate in all cases. It is definitely true that you need quite a few more passes over the data if you use Krylov sub-space methods such as Lanczos' algorithm, but those algorithms are definitely not state of the art for large-scale parallel decompositions.
(Oct 11 '10 at 00:08)
Ted Dunning
1
There is a Matlab implementation of probabilistic PCA in this toolbox. The main computational costs at each EM-iteration are in the inversion of the covariance matrix. The costs for the inversion are cubic in the number of dimensions in which you're embedding. So the fewer dimensions you want to preserve, the faster it gets... Instead of Lanczos algorithms, you could also look into Jacobi-Davidson eigensolvers. In my experience, these tend to be faster than Lanczos on large problems.
(Oct 12 '10 at 13:48)
Laurens van der Maaten
|
|
Ted Dunning has told me that his technique for fast dimensionality reduction is a random projection followed by SVD. YMMV. 2
This is essentially the algorithm described in last year's NIPS tutorial by Gunnar Martinsson. From what I recall, you should be able to get a very good approximation to the k-truncated svd using something like k+5 dimensions for the random projection and truncating the SVD of that.
(Oct 10 '10 at 13:26)
Alexandre Passos ♦
3
This is exactly the method that I was referring to. See http://arxiv.org/abs/0909.4061 for Martinsson et al's most excellent paper on the subject. Depending on the structure of the singular values, you may need more than k+5 dimension, but the method is diabolically fast. It typically takes me 10x longer to read a sparse matrix in R than to compute the partial SVD.
(Oct 11 '10 at 00:05)
Ted Dunning
1
You can see the NIPS tutorial by Gunnar Martinsson at http://videolectures.net/nips09_martinsson_mvll/
(Oct 28 '10 at 19:09)
Stelios Sfakianakis
|
|
Random projections alone are not the same as a random projection based SVD decomposition method. See the arxiv link I posted just a bit ago. Also, keep in mind that beyond a very modest decrease in singular values, you are only going to get very fancy random numbers out of any SVD algorithm. Real data tends to have long tail x long tail structure which means that the singular values are likely to decay very quickly. This often means that only a few (at most 10-20) eigenvalues are large enough for the decomposition of a noisy matrix to justify extracting. There have been a number of experiments with, say LSI, that showed a much large dimension for the reduced space gave better results. My experience that the reduced dimensionality should, indeed, be larger than 10 for most problems, but that replacing eigenvectors 11..300 with random numbers works just as well as 300 supposedly real eigenvectors. The number of latent factors required varies widely with problem. The LSI team routinely used 300. HNC Software also used 300 with Convectis. At the other end, Menon and Elkan got very good results on recommendation problems with 5 latent factors in their latent factor log-linear model (LFL). |
|
It depends somewhat on which approaches you're prepared to consider for working with the data afterwards. If you're not averse to offsetting some of the time gains from preprocessing the data by random projection with, say, boosting an ensemble of weak learners to reduce the impact of information loss and variability in the randomly projected data then you can still probably expect good results (especially since your text data are likely to be reasonably sparse, so compressed sensing results imply that most of the information from the full dimensional data is retained following RP). In (Bingham, 2001) the authors got good results from RP on text data. In (Fradkin and Madigan, 2003) RP was found to be competitive, on average, with PCA for a variety of classifiers on a range of standard datasets. More recently we presented some theory at KDD 2010 (Durrant and Kaban, 2010) that gives guarantees on classification error for RP combined with Fisher's Linear Discriminant (on average, over the random picks of projection matrix) when the projection dimensionality is only logarithmic in the number of classes. We found that the classification error increases nearly exponentially as the projected dimension gets smaller, and (unsurprisingly) the variability of the classifier performance also increases as the projected dimension drops. Nonetheless, one can still expect better than chance performance from the projected classifier in this setting, so an ensemble approach can recover the lost performance. The space complexity savings then more or less come for free. |
|
the random projection -> pca is waaay faster than pca alone. I'd like to try kernel pca next, i'm still not sure how it works yet. In the end, i'd like to put this system into production, given that the results pan out. i'm trying to investigate the relationship between dimensionality and quality of predictions of a black-box predictive model, as a function of the number of labeled instances. i would expect that a lower dimensional representation has greater sample density, and fewer parameters to tune, so given only a few labeled examples, i would suspect that it'd be advantageous to project down and avoid variance-based errors. on the other hand, more data can overcome these errors for higher dimensional representations. is there a mistake in my thinking? |
|
Btw, here is some R code that implements a random projection SVD. The point here is that this is really simple to do and for sparse A, it is really, really fast. This approach is much simpler to get right than Lanczos and much easier to parallelize as well.
|
|
Specifically for text, there is a much cited work by Yang and Pedersen, " A comparative study on feature selection in text categorization" (ICML 1997). Maybe you'll find one of the methods there useful. |
|
Also, if your data has some natural topology (like an image), you can create a very sparse dissimilarity matrix between features. Once you have this, you can apply a ward cluster on it to reduce a bit the size of the data by feature agglomeration. My experience is that this works really well, as long as you don't agglomerate too many features. Indeed, you are agglomerating on noise as much as data. As you go higher in the Ward tree, the resulting clusters become more and more sensitive to the noise. We tend to use this technique in addition to other methods more robust, such as based on minimizing a convex function, rather than a greedy solution (see our paper for application to feature selection in brain imaging).
This answer is marked "community wiki".
|