Let's say I have a low-dimensional dense feature representation. I would like to transform this dense, low-dimensional representation into a sparse, high-dimensional representation, which can be better input for a linear classifier.

How do I sparsify a dense feature representation?

Here's a concrete example of how this could me used: Let's say I have 1M images, each of them 32x32 pixels. Instead of representing each image as a dense vector with 1024 entries, I would maybe want a 10K vector with only 10 non-zeroes.

asked Jun 15 '10 at 11:24

Joseph%20Turian's gravatar image

Joseph Turian ♦♦

edited Jun 15 '10 at 16:00

7 Answers:

Sounds like you might have good input for training SVMs. Might be worth thinking about what kind of kernel would best allow an SVM to capture the structure inherent in your dataset.

answered Jun 15 '10 at 15:57

Jey%20Kottalam's gravatar image

Jey Kottalam

Is your idea that I train an SVM, and somehow use the support vectors as my sparse feature representation? Could you elaborate?

Also, how can I convert a dense feature vector into a sparse one using SVMs if I don't have any labels for my training data?

(Jun 15 '10 at 15:59) Joseph Turian ♦♦

Oops, I was under the wrong impression that you had labels since you want to train linear classifiers. That's an interesting question. I don't know of any ready made solution to such a problem, it sounds to me like it would really depend on what sort of structure you wanted to capture, e.g. SIFT for object recognition in images.

(Jun 15 '10 at 17:45) Jey Kottalam

I would just learn a sparse code for your data, using one of the dictionary learning sparse coding techniques (most derived from the basic work of Olshausen & Field 1996).

A cheaper option would be to learn a sparse auto-encoder (e.g. with a criterion that promotes sparsity of the representation), or something in between sparse coding and auto-encoders (and also in between in terms of computational cost), the predictive sparse decomposition algorithm from Yann LeCun's lab at NYU.

answered Jun 15 '10 at 18:02

Yoshua%20Bengio's gravatar image

Yoshua Bengio


The lecture notes by Andrew Ng give details on implementing such a component wise sparse activation penalty:


I think this is somewhat related to some of the papers by LeCun and Ranzato.

On the dictionary learning side (used for sparse coding), the work by Julien Mairal features a scalable online learning algorithm combining alternated LARS for sparse coding and an online dictionary learner that updates it's parameters based on the latest version of the sparse codes:


(Jun 23 '10 at 16:18) ogrisel

Does anyone have a copy of this lecture (sparseAutoencoder.pdf)? The link seems to be broken, and I couldn't find it anywhere [Google Search, Stanford, Internet Archive].

(Sep 29 '10 at 19:59) Dmitry Chichkov

You can find it here: http://www.stanford.edu/class/archive/cs/cs294a/cs294a.1104/sparseAutoencoder.pdf


(Dec 06 '10 at 08:39) npinto

You could run k-means (or some other centroid-based clustering algorithm) with a huge number of means, and make your new representation some measure of distance to some subset of the means (such as the ten nearest neighbours).

This answer is marked "community wiki".

answered Jun 16 '10 at 00:01

David%20Warde%20Farley's gravatar image

David Warde Farley ♦


Yes, I believe Doug Eck mentioned that this is the method used by Gal Chechik et al (JMLR 2010), "Large scale online learning of image similarity through ranking".

(Jun 17 '10 at 15:03) Joseph Turian ♦♦

Thanks, I couldn't remember who he said the reference was from.

(Jun 17 '10 at 15:13) David Warde Farley ♦

Yes, we did exactly that, with 1 nearest neighbour. We then applied normalized tf-idf to control the very unequal distribution one gets.

(Jul 01 '10 at 16:12) ualex

One option would be to use the Indian buffet process (IBP) to learn a sparse binary representation of each dense vector. Though in practice this wouldn't represent a low-dimensional dense vector to high-dimensional sparse vector since IBP tries to find a low(er) dimensional sparse representation of your dense vector (akin to factor analysis/dimensionality reduction), but the nice thing with IBP is that you don't need to know (i.e., set a priori) how long your sparse vectors should be (basically, similar to how many basis vectors you should use while doing PCA).

However, if you really want the sparse vectors to be higher dimensional than the original dense ones, then you can do sparse coding (as Yoshua suggests) with an overcomplete basis.

answered Sep 30 '10 at 03:54

spinxl39's gravatar image


Another alternative is to extract the top k singular vector of your data and use them as a dictionary for a sparse encoder that will find a sparse projection on this singular space by using a l1-penalized method that tries to minimize the squared reconstruction error. You can do this with either LARS or a coordinate descent implementation of the Lasso for instance.

answered Dec 06 '10 at 09:31

ogrisel's gravatar image


Adam Coates et al. (2010) show how soft k-means can be better than other options (sparse autoencoders, or deep nets).

"An Analysis of Single-Layer Networks in Unsupervised Feature Learning"

answered Dec 22 '10 at 11:01

Yariv%20Maron's gravatar image

Yariv Maron

Yes this is indeed very relevant and a very promising approach: I started some toy implementation in a scikit-learn branch to reproduce those results here:


This is pretty much work in progress but early feed back is always appreciated.

(Dec 22 '10 at 11:05) ogrisel

I have been playing around with the two main (new) ideas (kmeans in whitened space, kmeans-triangle) in the paper the last days. In general I had a collection of about 150.000 image patches, preprocessed with mean/std. normalization and final whitening. - With 8x8 patches, I generally get centers with kmeans, that have the nice local structure as depicted in the paper (in the unwhitened space), this is independent of the number of centers (tried up to 400). - With 16x16 patches, this no longer works so nicely. With 200 centers, half (or 2/3) of the centers have nice local features, the other half (or 1/3) clearly resemble some images from the collection. - Sparseness (kmeans-triangle): I'm computing the f_k(x)'s from the paper in the whitened space. Overall, the new representation for an image patch (8x8) is not so sparse (this is true for 20, 50, 100 or 200 clusters), only between 50-60% of the entries are 0. The new represenration is computed on the 150.000 images that were used for finding the centers.

I think that the 'problem' with the 16x16 patches could be resolved using more image patches, I'll try 500.000 this evening, if things go well. Any suggestions regarding sparseness? I'm looking for at least 90% of the entries having zeros. Any ideas, suggestions (pointing out possible mistakes on my side) are very welcome :).

(Dec 23 '10 at 08:53) osdf

osdf if you continued your work on this k-means work, I would be happy to hear more about it, perhaps post here? https://sites.google.com/site/kmeanslearning

(Jan 04 '11 at 22:24) karpathy

It would seem to me that this is close to dictionary learning. You might want to take a lookat the following algorithms:


In compressed sensing, we need dictionaries to eventually recover the original sparse signal.

This answer is marked "community wiki".

answered Jan 05 '11 at 12:20

Igor%20Carron's gravatar image

Igor Carron

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.