In this paper, we present techniques to compute confidence score on the predictions made by an end-to-end speech recognition model. Our proposed neural confidence measure (NCM) is trained as a binary classification task to accept or reject an end-to-end speech recognition result. We incorporate features from an encoder, a decoder, and an attention block of the attention-based end-to-end speech recognition model to improve NCM significantly. We observe that using information from multiple beams further improves the performance. As a case study of this NCM, we consider an application of the utterance-level confidence score in a distributed speech recognition environment with two or more speech recognition systems running on different platforms with varying resource capabilities. We show that around 57% computation on a resource-rich high-end platform (e.g. a cloud platform) can be saved without sacrificing accuracy compared to the high-end only solution. Around 70–80% of computations can be saved if we allow a degradation of word error rates to within 5–10% relative to the high-end solution.
Cite as: Kumar, A., Singh, S., Gowda, D., Garg, A., Singh, S., Kim, C. (2020) Utterance Confidence Measure for End-to-End Speech Recognition with Applications to Distributed Speech Recognition Scenarios. Proc. Interspeech 2020, 4357-4361, doi: 10.21437/Interspeech.2020-3216
@inproceedings{kumar20f_interspeech, author={Ankur Kumar and Sachin Singh and Dhananjaya Gowda and Abhinav Garg and Shatrughan Singh and Chanwoo Kim}, title={{Utterance Confidence Measure for End-to-End Speech Recognition with Applications to Distributed Speech Recognition Scenarios}}, year=2020, booktitle={Proc. Interspeech 2020}, pages={4357--4361}, doi={10.21437/Interspeech.2020-3216} }