What is the "standard" procedure for dealing with frames at the start and end of a speech file when concatenating input frames for training a neural network?

When you are inputing multiple frames to a neural network and training it to output the probability distribution over hmm state labels for the middlemost frame, are the first few and last few input vectors padded with zeros? If you don't pad and simply start concatenating frames from the start to the end of file, do you encounter problems with some hmm state labels no longer being present in the training data?

asked Apr 18 '13 at 22:06

Ryan%20Price's gravatar image

Ryan Price
6112

I'm a bit confused on using the NN to output the probability distribution over the state labels.

Do you mean the transition probabilities?

(Apr 19 '13 at 17:56) Leon Palafox ♦

To clarify - I'm just referring to using the NN for acoustic modeling. So the NN is trained to output a probability distribution over the possible labels for the middlemost frame of the input vector (a posterior distribution over HMM states, not transition probabilities).

(Apr 21 '13 at 20:27) Ryan Price

One Answer:

I create windowsize/2 replicas of the first and last frame of the acoustic data. So if I am concatenating 11 frames to serve as input to the neural net, then I would extend the sequence of frames by 5 copies of the first frame at the beginning and 5 copies of the last frame at the end.

answered Apr 19 '13 at 19:46

gdahl's gravatar image

gdahl ♦
341453559

Thank you!

(Apr 21 '13 at 20:18) Ryan Price
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.