|
I'm trying to do audio feature extraction with Deep Belief Networks on the TI digits dataset. The goal is to compare these extracted features with standard MFCC features. For classification LibSVM is used The method with MFCC and SVM gives over 90% performance and works well. However the RBM/DBN does not work very well. The problem is that almost every digit is classified as a two. I use FFT features as input for a RBM at the moment. The RBM used is from the matrbm library (http://code.google.com/p/matrbm/). The RBM is trained on the complete training set of the TI Digits set. Then classification is done as follows: FFT features are used as input for the RBM and the hidden node values are used to train a SVM. However this approach seems to simple/does not work. Does anybody have tips to improve this problem? Edit: more details Variable length digits is indeed a problem. What I'm doing right no, but not optimal, is capping everything after a certain number of frames and filling with zeros if the length is smaller. To convert FFT features to input for the RBM I scale them to a [0,1] scale. For the number of hidden units in the RBM I used values ranging from 100 to 1000. For the MFCC features I use also log energy, delta and delta-delta features, but this is for the baseline method to compare with. This works well, it gives almost 97,9% performance at the moment Edit2: found possible problem I use the matlab voicebox toolbox (http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html) and extract fft features for each file as follows:
When using this as input to a SVM it gives 81.55% performance. Now I did scale the features to a [0,1] scale because that is necesarry for the RBM:
But when using this as input for a SVM performance drops down to about 20% and almost everything is classified as a 1 or 2. Anyone ideas how to feed an audio file to a RBM?
showing 5 of 6
show all
|
|
Here is what I can think of with a quick glance at what you are doing.
Be as sure as possible that the input features given to the RBM are good before trying to train an RBM and also make sure you are training an appropriate RBM. At this point, you have one setup that happens to contain an RBM that doesn't work well and another setup that does work well, but has many more differences than just not using an RBM. Simplify your comparisons until you understand what is happening. Thanks for you valuable comment. I'm throwing away now phase information. The goal of the project is compare learned features with hand designed (the MFCCs in the baseline method). Do you know how Honglak Lee et all in "Unsupervised feature learning for audio classification using convolutional deep belief networks" presented a spectogram to a RBM. Did they use a Gaussian RBM?
(Jan 10 '12 at 11:06)
Yu Chong
Yes, Honglak Lee used Gaussian visible units when dealing with real-valued input. Take a look at equation 6 in his paper: http://www.eecs.umich.edu/~honglak/nips09-AudioConvolutionalDBN.pdf
(Jan 14 '12 at 20:06)
gdahl ♦
|
|
Things I would check:
I used the RBM implementation made by Andrej Karpathy and available here: http://code.google.com/p/matrbm/. This does work well on the supplied examplecode.m on MNIST. The reconstruction error goes down only for the first epoch. After the first epoch it remains approximately equal.
(Jan 05 '12 at 06:55)
Yu Chong
|
|
Have you tried giving the variables zero mean, unit variance and using gaussian visible units for the RBM? You could also try simply binning the values. In my experience using [0, 1] variables works very poorly for RBMs unless the values have a loosely probabilistic interpretation. Either of the former options should work much better. Regarding your second edit, I think looking at the performance of the features using just the SVM as a sanity check is a sensible thing to do and the 60% drop after normalizing is indicative of a problem. Unfortunately, I've never done much with SVMs so can't speak to what would be causing such a drastic drop in performance. |
|
I am not sure you will get interesting results with a single RBM. AFAIK unsupervised feature extractions usual works with 2 or 3 layers DBNs. As for you debugging, did you do a grid search for both C and gamma on you SVM classifier? How do the extracted RBMs features look like? Have you checked that each unit (e.g. dimension or extracted feature) is The plan is indeed to add more layers, but I'm not sure if that will work if the first layer is not able to extract meaningfull features at the moment. So my plan is to first fix problems with the first layer. Also thanks for your other tips, I will have a look at the things you mentioned
(Jan 05 '12 at 06:50)
Yu Chong
|
|
I'm seeing two issues with this setup. 1) rfft computes the FFT of a real signal, so the output of rfft is a collection of complex numbers that represent magnitude and phase. For speech recognition, we usually discard the phase information and work with a log-power spectrum. The log is needed to reduce the dynamic range of the features. 2) If I'm understanding the matlab notation correctly, it appears that your features are the mean and variance of each coefficient from the fft, computed over the entire utterance. This discards a lot of information about the identity of the utterance. You'd have more luck dividing each utterance into a fixed number of segments and computing statistics over those segments, so you keep more temporal information. Even then, this will only work for a small-vocabulary task like digits. You are right. I'm planning now to try it with 300 learned basis functions. I use a sliding window of 6 consecutive frames that slides through each utterance. Each window of 6 frames is used as input for an RBM. Then each utterance can be described by a linear combination of basis functions and offsets.
(Jan 10 '12 at 11:00)
Yu Chong
|
|
Can you provide a bit more detail on your setup? How are you converting variable-length sequences of features to a fixed-length input to the SVM? Are you using only MFCC features, or MFCCs with delta (temporal differences) features?
A quick question; rather than FFTs have you tried presenting spectrograms to the DBN?
@amair What do you exactly mean with the difference between FFTs and spectograms?
You might want to plot the FFTs going to the RBM to make sure that the data is separable. Additionally, it's worth taking a look at the output of the RBM to see if the SVM can deal extracted features. It might be the RBM is working but the features still aren't useful to the SVM and may require additional RBMs.
@yu chong the paper I was thinking of that used spectrograms was [Unsupervised feature learning for audio classification using convolutional deep belief networks][1].
[1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.154.380&rep=rep1&type=pdf
That is an interesting paper. I used it as inspiration for this project. However I read that paper multiple times and I still do not fully understand how they feed the spectogram to a convolutional DBN. Do you have any ideas? Thanks