At the moment I am studying convolutional deep belief networks for audio classification tasks. I read some papers about it but still I have a lot of questions. My question is how to apply CDBNs to audio data?

In a paper: For the application of CDBNs to audio data, we first convert time-domain signals into spectrograms. However, the dimensionality of the spectrograms is large (e.g., 160 channels). We apply PCA whitening to the spectrograms and create lower dimensional representations. Thus, the data we feed into the CDBN consists of nc channels of one-dimensional vectors of length nV , where nc is the number of PCA components in our representation. Similarly, the first-layer bases are comprised of nc channels of one-dimensional filters of length nW.

What do they mean by dimensionality of the spectograms? And what is now the total number of input units for the CDBN?

Edit Is it true that the whole spectogram (after PCA whiteninng) is used as input at once? But what do they mean then by filter length (nW) of 6?

Edit2 Ok, so a more specific question. In Table 1 the test classification accuracy for speaker identification using summary statistics is reported. Take MFCC for example. Then for each frame (20 ms with overlap of 10ms) features are extracted. See for l frames and n features you have l * n features. Now summary statistics are used. So now you have a n dimensional vector as input for the supervised classifier (SVM). But how is this done for the CDBN, because here each base has a filter length of 6 (filter looks at 6 frames together and convolves through the time??). So then you have to take summary statistics about less feature vectors than in the case of MFCC?

asked Oct 24 '11 at 14:18

Yu%20Chong's gravatar image

Yu Chong
31335

edited Oct 25 '11 at 15:52

Maybe it would be good if you gave a reference for the paper.

(Oct 24 '11 at 14:20) Andreas Mueller

Here it is: http://www.eecs.umich.edu/~honglak/nips09-AudioConvolutionalDBN.pdf. Unsupervised feature learning for audio classification using convolutional deep belief networks. Honglak Lee, Yan Largman, Peter Pham, and Andrew Y. Ng. Advances in Neural Information Processing Systems (NIPS) 22, 2010.

(Oct 24 '11 at 14:22) Yu Chong

My experience in the past re: Honglak Lee and his paper on CDBNs for images: Honglak is quite responsive to questions. You might consider dropping him an email -- preferably with a more specific question.

(Oct 25 '11 at 11:24) Brian Vandenberg

I am working on Honglak Lee paper (mentioned in this thread). My question is, after spectrogram, I get a 2-dimensional matrix, and after PCA whitening, I retain (say) 100 components. So, I will have a matrix of size N-by-100, where N is the number of time bins.
Should I apply the pca to each and every audio file separately, and then, concatenate all the matrices (i.e. after 10 matrices for 10 files after pca) to apply them to a DBN or an RBM? For each audio file, the size of the matrix after pca is not the same i.e. N is not the same. Does this effect the RBM performance?

(Dec 02 '13 at 09:27) chinaali

One Answer:

By "dimensionality", they simply mean the number of features (e.g. the dimensionality of an 8x8 image patch would be 64). In this case, the powers at different frequencies are treated as features. It sounds like they apply Principal Component Analysis to each fft window to reduce magnitudes of 160 evenly-spaced frequencies to 80 features. Section 2.2 of the paper gives a general overview of their approach, and section 3.1 gives the specifics.

answered Oct 24 '11 at 18:45

John%20Vinyard's gravatar image

John Vinyard
30115

edited Oct 24 '11 at 18:50

Thanks for your answer, it becomes more clear now. But what do they mean by one-dimensional vectors of length nV?

From 2.1: we assume that all inputs to the algorithm are single-channel time-series data with nV frames (an nV dimensional vector);

This means that the total number of frames is nV? With each frame for example 20ms and 10ms overlap?

(Oct 25 '11 at 04:19) Yu Chong
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.