|
What is the "standard" procedure for dealing with frames at the start and end of a speech file when concatenating input frames for training a neural network? When you are inputing multiple frames to a neural network and training it to output the probability distribution over hmm state labels for the middlemost frame, are the first few and last few input vectors padded with zeros? If you don't pad and simply start concatenating frames from the start to the end of file, do you encounter problems with some hmm state labels no longer being present in the training data? |
|
I create windowsize/2 replicas of the first and last frame of the acoustic data. So if I am concatenating 11 frames to serve as input to the neural net, then I would extend the sequence of frames by 5 copies of the first frame at the beginning and 5 copies of the last frame at the end. Thank you!
(Apr 21 '13 at 20:18)
Ryan Price
|
I'm a bit confused on using the NN to output the probability distribution over the state labels.
Do you mean the transition probabilities?
To clarify - I'm just referring to using the NN for acoustic modeling. So the NN is trained to output a probability distribution over the possible labels for the middlemost frame of the input vector (a posterior distribution over HMM states, not transition probabilities).