|
During my presentation "Machine Learning Empowered by Python", I created a real-time digit recognition demo. In order to make the system more robust, I had to manually extract features from images and train my system. Is there any algorithm to discover the best set of features from raw data and class information? |
|
Deep learning architectures such as stacked Restricted Boltzmann Machines or stached autoencoders leverage an unsupervised pre-training phase that can be understood as data-driven feature extraction. If you further know that your samples are 2D pictures, blending the afore mentioned models convolutional layers might further reduce the number of parameters to fit and bring some shift invariance. |
|
Taking the question "Is there any algorithm to discover the BEST set of features from raw data and class information?" literally, the answer is NO. Any algorithm for feature extraction, supervised or unsupervised, is extracting features under some set of assumptions, which may or may not be optimal for your problem. More to the point, if you have deep knowledge of a particular domain, you will often do better at defining features manually than any algorithm can do automatically. |
|
A good way of discovering good features is to learn the model using a subset of these features. Then calculate the error or measure how well the model fits. You can then try adding one feature at a time and see if it decreases the error significantly. If it does, it is a good discriminative feature. This is a type of greedy matching pursuit for classification/regression with l1 regularization (if you have a fixed threshold for adding a given feature you can see that this algorithm is doing coordinate ascent on feature space with l1 regularization, as in http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.149.5344&rep=rep1&type=pdf ). It can be good for feature selection but not so much for feature discovery.
(Jun 30 '10 at 19:49)
Alexandre Passos ♦
Agreed. I am still looking for an answer to the question posted.
(Jul 01 '10 at 01:06)
Aman
|
|
Some form of Principle Component Analysis can work well for selecting a small set of features from a larger set in a supervised learning environment. For example if you have 100 potential variables and you only want to use 40 of them, you can use some form of PCA to determine which of the variables have the most impact on your labeled results. There is a good PCA package as a part of numpy. http://folk.uio.no/henninri/pca_module/ I recommend using the SVD method. 5
If you know that there will be high multicollinearity in your feature set (e.g. overlapping blocks of HOGs), partial least squares is probably a better option. Specifically, whereas PCA solves max_{|r|=1} var(Xr), given a set of features X and labels Y, partial least squares solves max_{|r|=|s|=1} [cov(Xr, Ys)]^2; i.e. finding vectors that explain variance with respect to X and Y, instead of just X. It was put to good use in this state of the art human detection paper: http://www.umiacs.umd.edu/~lsd/papers/PLS-ICCV09.pdf. The paper's detection error tradeoff is the best so far.
(Jun 30 '10 at 19:28)
Tudor Achim
1
PCA does only a change of basis. From the moment you start discarding some of the Principal Axis (the ones with smaller eigenvalues), you have to know what you are doing, because they may actually be more discriminative than the ones with higher eigenvalues (the ones who capture most variance in the data). In other words, PCA can reduce the dimensionality of your data, but not necessarily in a way that helps perform a discriminative task afterwards.
(Jul 06 '10 at 05:54)
Hugo Penedones
Already upvoted Tudor, but I just want to underline that PLS is a really powerful method. Check out Locally weighted projection regression for example. Also, with regard to Hugo's answer, you may want to look at minor component analysis and (bayesian) eXtreme component analysis from Max Welling.
(Jul 06 '10 at 12:47)
osdf
|
|
Non-negative matrix factorization is supposed to do just about that. You have to specify the number of features you're looking for beforehand though. "Programming Collective Intelligence" by Tony Segaran had a pretty good chapter on it (I can't find much on the web, it's a relatively new technique). |
|
The Entropy of a feature might be a good way to assess its relevance to your classification task. |
|
There is a great recent graduate of Andrew Ng's, Honglak Lee, whose dissertation is on this topic and I believe in vision and speech he's had great success tackling this problem. I highly recommend everyone check out his work. In particular, Unsupervised feature learning for audioclassification using convolutional deep belief networks is recommended. |
What features did you manually define?