9
6

During my presentation "Machine Learning Empowered by Python", I created a real-time digit recognition demo. In order to make the system more robust, I had to manually extract features from images and train my system.

Is there any algorithm to discover the best set of features from raw data and class information?

asked May 19 '10 at 21:51

Francis%20Pi%C3%A9raut's gravatar image

Francis Piéraut
136457

edited May 26 '10 at 16:50

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

1

What features did you manually define?

(May 26 '10 at 16:50) Joseph Turian ♦♦

7 Answers:

Deep learning architectures such as stacked Restricted Boltzmann Machines or stached autoencoders leverage an unsupervised pre-training phase that can be understood as data-driven feature extraction. If you further know that your samples are 2D pictures, blending the afore mentioned models convolutional layers might further reduce the number of parameters to fit and bring some shift invariance.

answered Jun 23 '10 at 17:44

ogrisel's gravatar image

ogrisel
498995591

edited Jun 23 '10 at 18:28

Taking the question "Is there any algorithm to discover the BEST set of features from raw data and class information?" literally, the answer is NO. Any algorithm for feature extraction, supervised or unsupervised, is extracting features under some set of assumptions, which may or may not be optimal for your problem. More to the point, if you have deep knowledge of a particular domain, you will often do better at defining features manually than any algorithm can do automatically.

answered Jun 30 '10 at 09:20

Dave%20Lewis's gravatar image

Dave Lewis
890202846

A good way of discovering good features is to learn the model using a subset of these features. Then calculate the error or measure how well the model fits. You can then try adding one feature at a time and see if it decreases the error significantly. If it does, it is a good discriminative feature.

answered Jun 30 '10 at 15:56

Aman's gravatar image

Aman
2614916

This is a type of greedy matching pursuit for classification/regression with l1 regularization (if you have a fixed threshold for adding a given feature you can see that this algorithm is doing coordinate ascent on feature space with l1 regularization, as in http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.149.5344&rep=rep1&type=pdf ). It can be good for feature selection but not so much for feature discovery.

(Jun 30 '10 at 19:49) Alexandre Passos ♦

Agreed. I am still looking for an answer to the question posted.

(Jul 01 '10 at 01:06) Aman

Some form of Principle Component Analysis can work well for selecting a small set of features from a larger set in a supervised learning environment. For example if you have 100 potential variables and you only want to use 40 of them, you can use some form of PCA to determine which of the variables have the most impact on your labeled results.

There is a good PCA package as a part of numpy. http://folk.uio.no/henninri/pca_module/ I recommend using the SVD method.

answered Jun 30 '10 at 17:59

Joel%20H's gravatar image

Joel H
6123

5

If you know that there will be high multicollinearity in your feature set (e.g. overlapping blocks of HOGs), partial least squares is probably a better option. Specifically, whereas PCA solves

max_{|r|=1} var(Xr),

given a set of features X and labels Y, partial least squares solves

max_{|r|=|s|=1} [cov(Xr, Ys)]^2;

i.e. finding vectors that explain variance with respect to X and Y, instead of just X. It was put to good use in this state of the art human detection paper: http://www.umiacs.umd.edu/~lsd/papers/PLS-ICCV09.pdf. The paper's detection error tradeoff is the best so far.

(Jun 30 '10 at 19:28) Tudor Achim
1

PCA does only a change of basis. From the moment you start discarding some of the Principal Axis (the ones with smaller eigenvalues), you have to know what you are doing, because they may actually be more discriminative than the ones with higher eigenvalues (the ones who capture most variance in the data). In other words, PCA can reduce the dimensionality of your data, but not necessarily in a way that helps perform a discriminative task afterwards.

(Jul 06 '10 at 05:54) Hugo Penedones

Already upvoted Tudor, but I just want to underline that PLS is a really powerful method. Check out Locally weighted projection regression for example. Also, with regard to Hugo's answer, you may want to look at minor component analysis and (bayesian) eXtreme component analysis from Max Welling.

(Jul 06 '10 at 12:47) osdf

Non-negative matrix factorization is supposed to do just about that. You have to specify the number of features you're looking for beforehand though. "Programming Collective Intelligence" by Tony Segaran had a pretty good chapter on it (I can't find much on the web, it's a relatively new technique).

answered Jul 05 '10 at 23:26

sbirch's gravatar image

sbirch
2415711

The Entropy of a feature might be a good way to assess its relevance to your classification task.
A feature which occurs uniformly across classes has high entropy and does not really provide much information to help you in your classification.
OTOH one which occurs only in one or just a few classes has a lower entropy, i.e. is more opinionated, so might be a better training feature.
Eventually you could start adding additional features by figuring out which ones can help you discriminate between the non-diagonal pairs of your confusion matrix which show a high misclassification rate.

answered Jul 06 '10 at 02:18

Aditya%20Mukherji's gravatar image

Aditya Mukherji
2251612

edited Jul 06 '10 at 02:19

There is a great recent graduate of Andrew Ng's, Honglak Lee, whose dissertation is on this topic and I believe in vision and speech he's had great success tackling this problem. I highly recommend everyone check out his work. In particular, Unsupervised feature learning for audioclassification using convolutional deep belief networks is recommended.

answered Jul 06 '10 at 14:40

aria42's gravatar image

aria42
209972441

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.