One of the benefits of deep learning is that we do not need to hand engineer features for a specific domain. However, for the areas where deep learning has been successful, I am wondering if there has been any applied research that has compared deep learning algorithms when the (i) raw data is fed to them versus (ii) raw data as well as the field's best hand-engineered features are fed to them versus (iii) only the field's best hand-engineered features are fed to them.

So far I have only seen (i) done/discussed (e.g., feeding it the raw pixels for a computer vision task).

asked Feb 03 '14 at 22:57

Jeremiah%20M's gravatar image

Jeremiah M
6447


2 Answers:

I know for speech and audio the "raw data" is rarely used for best results. Instead they use a Fourier or wavelet transform and then use convnets or RNNs on the image-like output of the transform. I have seen a paper where they used the raw audio data and a 1D conv layer to directly learn features, but I think they were not able to equal the results (though maybe came close if I recall) vs learning on the transformed input.

answered Feb 06 '14 at 21:55

Dan%20Ryan's gravatar image

Dan Ryan
40671116

Nice one. What about for computer vision?

(Feb 08 '14 at 08:12) Jeremiah M

Dan, I was wondering if you have a reference for this paper, or if you have any idea how I could go about finding it? I've done some very similar work recently for music audio signals, so I'd be interested to compare results :) Thanks!

With regards to computer vision: a lot of times, no preprocessing whatsoever is done, convolutional neural nets work pretty well with raw pixel data nowadays. Sometimes there is some preprocessing, like contrast normalisation or whitening, but definitely nothing that could qualify as 'feature extraction'.

In that light, I think it's interesting that for audio, we still need some kind of mid-level representation to get the best performance with deep learning techniques (that was my conclusion for music signals as well).

Then again, one could argue that the brain does not process raw audio signals either, the inner ear already does something that amounts to a frequency decomposition.

(Feb 08 '14 at 19:07) Sander Dieleman

There was a paper on Multimodal data by Hinton's Group where hand engineered features for images were used: www.cs.toronto.edu/~rsalakhu/papers/Multimodal_DBM.pdf‎

answered Feb 08 '14 at 09:32

Sharath%20Chandra's gravatar image

Sharath Chandra
311131621

What was the result?

(Feb 08 '14 at 10:17) Jeremiah M
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.