I'm trying to do audio feature extraction with Deep Belief Networks on the TI digits dataset. The goal is to compare these extracted features with standard MFCC features. For classification LibSVM is used

The method with MFCC and SVM gives over 90% performance and works well. However the RBM/DBN does not work very well. The problem is that almost every digit is classified as a two.

I use FFT features as input for a RBM at the moment. The RBM used is from the matrbm library (http://code.google.com/p/matrbm/). The RBM is trained on the complete training set of the TI Digits set. Then classification is done as follows: FFT features are used as input for the RBM and the hidden node values are used to train a SVM. However this approach seems to simple/does not work. Does anybody have tips to improve this problem?

Edit: more details Variable length digits is indeed a problem. What I'm doing right no, but not optimal, is capping everything after a certain number of frames and filling with zeros if the length is smaller.

To convert FFT features to input for the RBM I scale them to a [0,1] scale. For the number of hidden units in the RBM I used values ranging from 100 to 1000.

For the MFCC features I use also log energy, delta and delta-delta features, but this is for the baseline method to compare with. This works well, it gives almost 97,9% performance at the moment

Edit2: found possible problem I use the matlab voicebox toolbox (http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html) and extract fft features for each file as follows:

[y, fs] = readwav(filename);
frames = enframe(y, window, length(window) / 2)';
F = rfft(frames);
f = [mean(F') std(F')];
trainingSet(i,:) = f;

When using this as input to a SVM it gives 81.55% performance. Now I did scale the features to a [0,1] scale because that is necesarry for the RBM:

 mu = mean(trainingSet);
    range = max(trainingSet) - min(trainingSet);

    % Scale features
    trainingSet = (trainingSet - repmat(mu, size(trainingSet, 1), 1)) ./ repmat(range, size(trainingSet, 1), 1);
    % To [0,1] scale
    trainingSet = (trainingSet + 1) / 2;

But when using this as input for a SVM performance drops down to about 20% and almost everything is classified as a 1 or 2. Anyone ideas how to feed an audio file to a RBM?

asked Jan 04 '12 at 05:55

Yu%20Chong's gravatar image

Yu Chong
31335

edited Jan 05 '12 at 16:03

Can you provide a bit more detail on your setup? How are you converting variable-length sequences of features to a fixed-length input to the SVM? Are you using only MFCC features, or MFCCs with delta (temporal differences) features?

(Jan 04 '12 at 08:17) bedk

A quick question; rather than FFTs have you tried presenting spectrograms to the DBN?

(Jan 04 '12 at 09:19) amair

@amair What do you exactly mean with the difference between FFTs and spectograms?

(Jan 05 '12 at 06:56) Yu Chong

You might want to plot the FFTs going to the RBM to make sure that the data is separable. Additionally, it's worth taking a look at the output of the RBM to see if the SVM can deal extracted features. It might be the RBM is working but the features still aren't useful to the SVM and may require additional RBMs.

(Jan 05 '12 at 11:48) nop

@yu chong the paper I was thinking of that used spectrograms was [Unsupervised feature learning for audio classification using convolutional deep belief networks][1].

[1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.154.380&rep=rep1&type=pdf

(Jan 05 '12 at 16:20) amair

That is an interesting paper. I used it as inspiration for this project. However I read that paper multiple times and I still do not fully understand how they feed the spectogram to a convolutional DBN. Do you have any ideas? Thanks

(Jan 06 '12 at 06:24) Yu Chong
showing 5 of 6 show all

6 Answers:

Here is what I can think of with a quick glance at what you are doing.

  • Don't use a binary-binary RBM when you don't have binary data.

  • I am suspicious of the matrbm library. It doesn't look to be of very high quality. For example, from a quick perusal, it doesn't seem like the softmax gets computed in the numerically stable way. Who knows what other defects it has?

  • You should throw away the phase in your input features, do you do that?

  • Reasonable features might be mel-filterbank log outputs. So basically compute MFCCs except for the cepstral transform.

  • Before you play around with an RBM, just train a simple logistic regression classifier on your input features. Or at least an SVM. Then try a single hidden layer neural net. Use the same method for dealing with the variable input length as you would for the RBM.

  • Another thing to try is to train a Gaussian RBM on the MFCCs.

  • If your baseline method works so well, why aren't you done? Make sure you actually need better performance on this task.

Be as sure as possible that the input features given to the RBM are good before trying to train an RBM and also make sure you are training an appropriate RBM. At this point, you have one setup that happens to contain an RBM that doesn't work well and another setup that does work well, but has many more differences than just not using an RBM. Simplify your comparisons until you understand what is happening.

answered Jan 07 '12 at 21:13

gdahl's gravatar image

gdahl ♦
341453559

Thanks for you valuable comment. I'm throwing away now phase information. The goal of the project is compare learned features with hand designed (the MFCCs in the baseline method). Do you know how Honglak Lee et all in "Unsupervised feature learning for audio classification using convolutional deep belief networks" presented a spectogram to a RBM. Did they use a Gaussian RBM?

(Jan 10 '12 at 11:06) Yu Chong

Yes, Honglak Lee used Gaussian visible units when dealing with real-valued input. Take a look at equation 6 in his paper: http://www.eecs.umich.edu/~honglak/nips09-AudioConvolutionalDBN.pdf

(Jan 14 '12 at 20:06) gdahl ♦

Things I would check:

  1. Make sure your RBM implementation is correct. It's tricky, so I'd recommend visualizing filters on MNIST.
  2. Does the reconstruction error go down? (By this I mean, does the quality of the sampling reconstructions get better?) It is a dangeruous measure, though. See the practical guide on RBMs.

answered Jan 04 '12 at 10:58

Justin%20Bayer's gravatar image

Justin Bayer
170693045

I used the RBM implementation made by Andrej Karpathy and available here: http://code.google.com/p/matrbm/. This does work well on the supplied examplecode.m on MNIST.

The reconstruction error goes down only for the first epoch. After the first epoch it remains approximately equal.

(Jan 05 '12 at 06:55) Yu Chong

Have you tried giving the variables zero mean, unit variance and using gaussian visible units for the RBM? You could also try simply binning the values. In my experience using [0, 1] variables works very poorly for RBMs unless the values have a loosely probabilistic interpretation. Either of the former options should work much better.

Regarding your second edit, I think looking at the performance of the features using just the SVM as a sanity check is a sensible thing to do and the 60% drop after normalizing is indicative of a problem. Unfortunately, I've never done much with SVMs so can't speak to what would be causing such a drastic drop in performance.

answered Jan 06 '12 at 11:46

alto's gravatar image

alto
60351124

I am not sure you will get interesting results with a single RBM. AFAIK unsupervised feature extractions usual works with 2 or 3 layers DBNs.

As for you debugging, did you do a grid search for both C and gamma on you SVM classifier? How do the extracted RBMs features look like? Have you checked that each unit (e.g. dimension or extracted feature) is roughly active 50% of the time at least on some significant portion of the dataset?

answered Jan 04 '12 at 08:14

ogrisel's gravatar image

ogrisel
498995591

edited Jan 05 '12 at 08:13

The plan is indeed to add more layers, but I'm not sure if that will work if the first layer is not able to extract meaningfull features at the moment. So my plan is to first fix problems with the first layer.

Also thanks for your other tips, I will have a look at the things you mentioned

(Jan 05 '12 at 06:50) Yu Chong
1

Fixing the first layer before adding more is a good idea.

(Jan 07 '12 at 21:07) gdahl ♦

I'm seeing two issues with this setup.

1) rfft computes the FFT of a real signal, so the output of rfft is a collection of complex numbers that represent magnitude and phase. For speech recognition, we usually discard the phase information and work with a log-power spectrum. The log is needed to reduce the dynamic range of the features.

2) If I'm understanding the matlab notation correctly, it appears that your features are the mean and variance of each coefficient from the fft, computed over the entire utterance. This discards a lot of information about the identity of the utterance. You'd have more luck dividing each utterance into a fixed number of segments and computing statistics over those segments, so you keep more temporal information. Even then, this will only work for a small-vocabulary task like digits.

answered Jan 07 '12 at 12:16

bedk's gravatar image

bedk
5522

You are right. I'm planning now to try it with 300 learned basis functions. I use a sliding window of 6 consecutive frames that slides through each utterance. Each window of 6 frames is used as input for an RBM. Then each utterance can be described by a linear combination of basis functions and offsets.

(Jan 10 '12 at 11:00) Yu Chong
  1. don't use binary-binary rbm for raw acoustic features -- use gaussian-binary

  2. do mean/variance normalization instead of [0,1] normalization

answered Jan 14 '12 at 18:23

exppie's gravatar image

exppie
1111

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.