|
One of the benefits of deep learning is that we do not need to hand engineer features for a specific domain. However, for the areas where deep learning has been successful, I am wondering if there has been any applied research that has compared deep learning algorithms when the (i) raw data is fed to them versus (ii) raw data as well as the field's best hand-engineered features are fed to them versus (iii) only the field's best hand-engineered features are fed to them. So far I have only seen (i) done/discussed (e.g., feeding it the raw pixels for a computer vision task). |
|
I know for speech and audio the "raw data" is rarely used for best results. Instead they use a Fourier or wavelet transform and then use convnets or RNNs on the image-like output of the transform. I have seen a paper where they used the raw audio data and a 1D conv layer to directly learn features, but I think they were not able to equal the results (though maybe came close if I recall) vs learning on the transformed input. Nice one. What about for computer vision?
(Feb 08 '14 at 08:12)
Jeremiah M
Dan, I was wondering if you have a reference for this paper, or if you have any idea how I could go about finding it? I've done some very similar work recently for music audio signals, so I'd be interested to compare results :) Thanks! With regards to computer vision: a lot of times, no preprocessing whatsoever is done, convolutional neural nets work pretty well with raw pixel data nowadays. Sometimes there is some preprocessing, like contrast normalisation or whitening, but definitely nothing that could qualify as 'feature extraction'. In that light, I think it's interesting that for audio, we still need some kind of mid-level representation to get the best performance with deep learning techniques (that was my conclusion for music signals as well). Then again, one could argue that the brain does not process raw audio signals either, the inner ear already does something that amounts to a frequency decomposition.
(Feb 08 '14 at 19:07)
Sander Dieleman
|
|
There was a paper on Multimodal data by Hinton's Group where hand engineered features for images were used: www.cs.toronto.edu/~rsalakhu/papers/Multimodal_DBM.pdf What was the result?
(Feb 08 '14 at 10:17)
Jeremiah M
|