Let's say I have a lowdimensional dense feature representation. I would like to transform this dense, lowdimensional representation into a sparse, highdimensional representation, which can be better input for a linear classifier. How do I sparsify a dense feature representation? Here's a concrete example of how this could me used: Let's say I have 1M images, each of them 32x32 pixels. Instead of representing each image as a dense vector with 1024 entries, I would maybe want a 10K vector with only 10 nonzeroes. asked Jun 15 '10 at 11:24 Joseph Turian ♦♦ 
Sounds like you might have good input for training SVMs. Might be worth thinking about what kind of kernel would best allow an SVM to capture the structure inherent in your dataset. answered Jun 15 '10 at 15:57 Jey Kottalam Is your idea that I train an SVM, and somehow use the support vectors as my sparse feature representation? Could you elaborate? Also, how can I convert a dense feature vector into a sparse one using SVMs if I don't have any labels for my training data?
(Jun 15 '10 at 15:59)
Joseph Turian ♦♦
Oops, I was under the wrong impression that you had labels since you want to train linear classifiers. That's an interesting question. I don't know of any ready made solution to such a problem, it sounds to me like it would really depend on what sort of structure you wanted to capture, e.g. SIFT for object recognition in images.
(Jun 15 '10 at 17:45)
Jey Kottalam

I would just learn a sparse code for your data, using one of the dictionary learning sparse coding techniques (most derived from the basic work of Olshausen & Field 1996). A cheaper option would be to learn a sparse autoencoder (e.g. with a criterion that promotes sparsity of the representation), or something in between sparse coding and autoencoders (and also in between in terms of computational cost), the predictive sparse decomposition algorithm from Yann LeCun's lab at NYU. answered Jun 15 '10 at 18:02 Yoshua Bengio 2
The lecture notes by Andrew Ng give details on implementing such a component wise sparse activation penalty: http://www.stanford.edu/class/cs294a/sparseAutoencoder.pdf I think this is somewhat related to some of the papers by LeCun and Ranzato. On the dictionary learning side (used for sparse coding), the work by Julien Mairal features a scalable online learning algorithm combining alternated LARS for sparse coding and an online dictionary learner that updates it's parameters based on the latest version of the sparse codes: http://jmlr.csail.mit.edu/papers/v11/mairal10a.html
(Jun 23 '10 at 16:18)
ogrisel
Does anyone have a copy of this lecture (sparseAutoencoder.pdf)? The link seems to be broken, and I couldn't find it anywhere [Google Search, Stanford, Internet Archive].
(Sep 29 '10 at 19:59)
Dmitry Chichkov
1
You can find it here: http://www.stanford.edu/class/archive/cs/cs294a/cs294a.1104/sparseAutoencoder.pdf HTH
(Dec 06 '10 at 08:39)
npinto

You could run kmeans (or some other centroidbased clustering algorithm) with a huge number of means, and make your new representation some measure of distance to some subset of the means (such as the ten nearest neighbours).
This answer is marked "community wiki".
answered Jun 16 '10 at 00:01 David Warde Farley ♦ 1
Yes, I believe Doug Eck mentioned that this is the method used by Gal Chechik et al (JMLR 2010), "Large scale online learning of image similarity through ranking".
(Jun 17 '10 at 15:03)
Joseph Turian ♦♦
Thanks, I couldn't remember who he said the reference was from.
(Jun 17 '10 at 15:13)
David Warde Farley ♦
Yes, we did exactly that, with 1 nearest neighbour. We then applied normalized tfidf to control the very unequal distribution one gets.
(Jul 01 '10 at 16:12)
ualex

One option would be to use the Indian buffet process (IBP) to learn a sparse binary representation of each dense vector. Though in practice this wouldn't represent a lowdimensional dense vector to highdimensional sparse vector since IBP tries to find a low(er) dimensional sparse representation of your dense vector (akin to factor analysis/dimensionality reduction), but the nice thing with IBP is that you don't need to know (i.e., set a priori) how long your sparse vectors should be (basically, similar to how many basis vectors you should use while doing PCA). However, if you really want the sparse vectors to be higher dimensional than the original dense ones, then you can do sparse coding (as Yoshua suggests) with an overcomplete basis. answered Sep 30 '10 at 03:54 spinxl39 
Another alternative is to extract the top k singular vector of your data and use them as a dictionary for a sparse encoder that will find a sparse projection on this singular space by using a l1penalized method that tries to minimize the squared reconstruction error. You can do this with either LARS or a coordinate descent implementation of the Lasso for instance. answered Dec 06 '10 at 09:31 ogrisel 
Adam Coates et al. (2010) show how soft kmeans can be better than other options (sparse autoencoders, or deep nets). "An Analysis of SingleLayer Networks in Unsupervised Feature Learning" answered Dec 22 '10 at 11:01 Yariv Maron Yes this is indeed very relevant and a very promising approach: I started some toy implementation in a scikitlearn branch to reproduce those results here: https://github.com/ogrisel/scikitlearn/commits/imagepatches This is pretty much work in progress but early feed back is always appreciated.
(Dec 22 '10 at 11:05)
ogrisel
I have been playing around with the two main (new) ideas (kmeans in whitened space, kmeanstriangle) in the paper the last days. In general I had a collection of about 150.000 image patches, preprocessed with mean/std. normalization and final whitening.  With 8x8 patches, I generally get centers with kmeans, that have the nice local structure as depicted in the paper (in the unwhitened space), this is independent of the number of centers (tried up to 400).  With 16x16 patches, this no longer works so nicely. With 200 centers, half (or 2/3) of the centers have nice local features, the other half (or 1/3) clearly resemble some images from the collection.  Sparseness (kmeanstriangle): I'm computing the f_k(x)'s from the paper in the whitened space. Overall, the new representation for an image patch (8x8) is not so sparse (this is true for 20, 50, 100 or 200 clusters), only between 5060% of the entries are 0. The new represenration is computed on the 150.000 images that were used for finding the centers. I think that the 'problem' with the 16x16 patches could be resolved using more image patches, I'll try 500.000 this evening, if things go well. Any suggestions regarding sparseness? I'm looking for at least 90% of the entries having zeros. Any ideas, suggestions (pointing out possible mistakes on my side) are very welcome :).
(Dec 23 '10 at 08:53)
osdf
osdf if you continued your work on this kmeans work, I would be happy to hear more about it, perhaps post here? https://sites.google.com/site/kmeanslearning
(Jan 04 '11 at 22:24)
karpathy

It would seem to me that this is close to dictionary learning. You might want to take a lookat the following algorithms: http://sites.google.com/site/igorcarron2/cs#sparse In compressed sensing, we need dictionaries to eventually recover the original sparse signal.
This answer is marked "community wiki".
answered Jan 05 '11 at 12:20 Igor Carron 