I have a large collection of audio recordings (by different speakers, and with time-varying content and background noise), and would like to estimate the accuracy of automatic speech recognition on them. I plan to choose a sample and have it accurately transcribed, and compare the ASR output with the gold standard transcription.

But how to draw the sample? If accuracy of the estimate was the only concern, one would like to choose a very large number of very small segments of audio, to best sample the different speakers and conditions. But at some point a segment of audio is too short to accurately transcribe: one would need to listen to speech on either side of the segment anyway. Also, well before that, the overhead of switching between different audio files in the transcription software, and the cognitive overhead of switching among different speakers, would argue for longer segments.

I'm curious both about how these choices have typically been made in speech recognition research, and what statistical tools/assumptions are typically made in computing accuracy estimates.

asked Oct 20 '10 at 17:26

Dave%20Lewis's gravatar image

Dave Lewis
785162644

Shouldn't it be easier to do the reverse: sample some words/sentences from the ground truth results, get the corresponding audio (assuming things are aligned) and compute test metrics?

(Oct 20 '10 at 20:09) Alexandre Passos ♦
1

Ah no, the point is that we don't have ground truth. We need to choose which audio to produce ground truth for. And I'm inclined to think one should do the sampling based on the audio, rather than on the text of the ASR output, in case the ASR output is really awful.

(Oct 20 '10 at 21:16) Dave Lewis

4 Answers:

Without more information it is hard to define what the best size of the segments is. Definitely I would not use short segments, because of the reason you said, and because id they are too short you may miss important phenomena which may affect SR performance, i.e. word boundary co-articulation. How many audio recordings do you have? Today's trend is to use as much data as possible, both for training as well as for statistically significant accuracy estimation. We generally transcribe millions of utterances (more than 200,000 a month) from deployed ASR systems. See, for instance

"Suendermann, D., Liscombe, J., Pieraccini, R., How to Drink from a Fire Hose: One Person Can Annoscribe 693 Thousand Utterances in One Month, SIGDIAL 2010, The 11th Annual SIGDIAL Meeting on Discourse and Dialogue, Sep 2010, Tokyo, Japan." (http://www.robertopieraccini.com/publications/2010/SIGDIAL48.pdf)

So, my suggestion is: transcribe them all. But if you do not have the adequate resources for doing that, just choose a random sample of reasonable size segments, more of less the size of a sentence (i.e. 5 to 10 words).

answered Nov 25 '10 at 07:12

Roberto%20Pieraccini's gravatar image

Roberto Pieraccini
5632

Roberto - We have about 170 hours of audio: 30 minutes of audio from each of roughly 340 subjects. The audio is recordings (from close-talking microphone for each subject) of undergraduates playing a cooperative computer game in teams of 3 to 5. Not everyone is talking at once, so it's only about 1600 words/hour, so perhaps 27,000 utterances total. According to your (very cool) paper one could with the right tools do that in about a day, but I suspect the transcriber we're hiring will disagree. :-) We are in fact going to do a complete transcription at low to moderate quality, but want to transcribe a sample more carefully to assess accuracy of both ASR and the initial transcription.

One possibility that occurred to me was to use the ASR system to segment the audio into segments it thinks have utterances vs. ones it thinks don't. Then use stratified sampling, strongly undersampling (but not ignoring - there's a fair bit of noise, so ASR won't be perfect on this) the segments it thinks don't have utterances.

(Nov 25 '10 at 08:52) Dave Lewis

Unfortunately this very interesting issue is not widely covered in ASR. Most people just calculate WER on specific dataset which is believed to be good for the specific task. Like if your task is in-car audio recognition, you buy in-car corpus and measure your accuracy on it.

There are few papers on the subject though, my recommendation is

Isabelle Guyon, John Makhoul, Richard Schwartz, and Vladimir Vapnik What Size Test Set Gives Good Error Rate Estimates? IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 20, NO. 1, JANUARY 1998

answered Oct 21 '10 at 13:29

Nickolay%20Shmyrev's gravatar image

Nickolay Shmyrev
465

edited Oct 21 '10 at 13:32

Thanks - a very useful paper. While it does address the issue of different speaker (or in their case writers), it unfortunately doesn't address the issue of what size segments to draw.

(Nov 01 '10 at 18:50) Dave Lewis

How have these choices typically been made in speech recognition research? The answer is with little thought beyond attempting to sample from different speakers.

What statistical assumptions have been made in computing accuracy estimates? The answer is that all samples are considered to be statistically independent.

Since you appear to be interested in estimating the performance only on your given collection, the problem doesn't seem too difficult, and the assumption of statistical independence of samples (given the collection) shouldn't be too daunting. The solution is simply to draw samples randomly (minding whatever factors you care to consider that play into statistical [in]dependence), with the number of samples determined by your desired level of accuracy according to the binomial distribution.

The more difficult problem is in estimating performance on one collection based on performance on a different (calibration) collection. Since there are so many factors that affect performance, some know such as SNR, but many unknown and all essentially uncalibrated, the best that can be reasonably hoped for is that the ranking of performance of different systems may not change [too much] from the calibration collection to the real collection.

answered Nov 24 '10 at 21:23

George%20Doddington's gravatar image

George Doddington
161

George, Thanks for dropping by. The big thing I was wondering about, however, was the length of each segment sampled from the audio files. Do you have a sense of the minimum number of seconds (or words, or utterances) that it's efficient to have someone transcribe?

(Nov 24 '10 at 21:31) Dave Lewis

Dave, I don't have anything to add in regards to how to sample your data correctly for the purposes of establishing a representative sample that hopefully correlates with the accuracy you would obtain in the full dataset.

However, I want to point out that my company, Data Engines Corporation, has developed a series of algorithms that do allow you to carry out measurements of the average WER on the full dataset without any correct transcripts! As impossible or magical as this may sound, there are limitations to our technology.

In particular, you need at least five recognizers to carry out the unsupervised inference calculation. Check out our blog posts to learn more about it.

answered Dec 15 '10 at 17:55

dataengines's gravatar image

dataengines
161

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.