I'm implementing Lee's convolutional DBN as a feature extraction method for bird song classification, and encountered some problems. As my understanding, they treat different frequency channels (with PCA though) in the spectrum independently to get a two-layer deep belief network. I'm confused about some details of their implementation:

  1. (Section 3.1) For each utterance, they had 300 max-pooled first-layers, then how to train the 300 second-layer bases? As my understanding, there will be 90000 max-pooled second hidden layers for each utterance, which is apparently wrong.
  2. (Section 4.1) They said, "We obtained spectrogram/MFCC/CDBN representations for each utterance with multiple (typically, several hundred) frames. We used simple summary statistics (for each channel) such as average, max, or standard deviation over all the frames." But for CDBN, there're hundreds of max-pooled hidden layers (each has multiple frames) of each utterance, so what are the "multiple frames" here of CBDN representations?

Thanks for your help!

Update:

Thanks @Sharath Chandra. I've checked their source code, it turns out they've combine the hidden layers together during training iteration.

asked Mar 22 '14 at 02:43

Jingwei%20Zhang's gravatar image

Jingwei Zhang
1112

edited Apr 17 '14 at 17:57


One Answer:
  1. Number of max-pooled first layers is independent of number of second layer bases. It is like this: You can have x number of inputs, and y number of hidden units in the first layer. Now in the second layer y becomes number of inputs and you can take a z which will be the hidden units.

  2. Multiple frames are the inputs. Representations are the features extracted.

answered Mar 22 '14 at 07:50

Sharath%20Chandra's gravatar image

Sharath Chandra
311131621

Thanks a lot! Could you please tell me if the following analysis goes wrong: assume the 1-D input has length 17, as their description, the layers has filter of length 6, max pooling ratio 3 and there're 300 hidden bases, then the first layer output has length (17 - 6 + 1)/3 x 300 = 1200, thus the second layer output has length (1200 - 5)/3 x 300, which is too large.

(Mar 22 '14 at 17:46) Jingwei Zhang

Yes the analysis sounds right to me. Its the size of input for the third layer. You can run it in mini-batches to be able to fit your RAM.

You have a look at Honglak's code on his website (link also in other threads in this site).

(Mar 23 '14 at 05:43) Sharath Chandra
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.