ABSTRACT
Transformer models have accelerated the field of speech recognition; deriving a low word error rate (WER) is demonstrably achievable under varying conditions. However, most ASR engines are trained on acoustic and language models constructed from corpora that include news feeds, books, and blogs in order to demonstrate generalization, leading to errors when the model is applied to a specific domain. While the increase in WER is acute for very specific domains (health and medicine), our work shows that it is sizable even when the domain is general (hospitality). For such domains, a lightweight adaptation approach can help; lightweight because the adaptation does not require extensive post-hoc training of additional domain-specific acoustic- or language-models that act as adjutants to the base ASR engine. We present our work on such lightweight filtering pipeline that seamlessly integrates lightweight models (n − gram, decision trees) with powerful, pre-trained, bi-directional transformer models, all working in conjunction to derive a 1-best hypothesis word selection algorithm. Our pipeline reduces the WER between 1.6% to 2.5% absolute while treating the ASR engine as a black box, and without requiring additional complex discriminative training.
- W. A. Ainsworth and S. R. Pratt. 1992. Feedback Strategies for Error Correction in Speech Recognition Systems. Int. J. Man-Mach. Stud. 36, 6 (June 1992), 833–842.Google ScholarDigital Library
- Youssef Bassil and Paul Semaan. 2012. ASR Context-Sensitive Error Correction Based on Microsoft N-Gram Dataset. ArXiv abs/1203.5262(2012).Google Scholar
- Tom B. Brown 2020. Language Models are Few-Shot Learners. arxiv:2005.14165 [cs.CL]Google Scholar
- Xiaodong Cui, Wei Zhang, Ulrich Finkler, George Saon, Michael Picheny, and David S. Kung. 2020. Distributed Training of Deep Neural Network Acoustic Models for Automatic Speech Recognition: A comparison of current training strategies. IEEE Signal Process. Mag. 37, 3 (2020), 39–49. https://doi.org/10.1109/MSP.2020.2969859Google ScholarCross Ref
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107–113. https://doi.org/10.1145/1327452.1327492Google ScholarDigital Library
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv e-prints, Article arXiv:1810.04805 (Oct. 2018), arXiv:1810.04805 pages. arxiv:1810.04805 [cs.CL]Google Scholar
- Luis D’Haro and Rafael Banchs. 2016. Automatic Correction of ASR Outputs by Using Machine Translation. In Interspeech. 3469–3473.Google Scholar
- Rahhal Errattahi, Asmaa El Hannani, and Hassan Ouahmane. 2018. Automatic speech recognition errors detection and correction: A review. Procedia Computer Science 128 (2018), 32–37.Google ScholarCross Ref
- Yohei Fusayasu, Katsuyuki Tanaka, Tetsuya Takiguchi, and Yasuo Ariki. 2015. Word-Error Correction of Continuous Speech Recognition Based on Normalized Relevance Distance. In IJCAI.Google Scholar
- Jordan Hosier, Vijay K Gurbani, and Neil Milstead. 2019. Disambiguation and Error Resolution in Call Transcripts. In 2019 IEEE International Conference on Big Data (Big Data). IEEE, 4602–4607.Google Scholar
- Chang Liu, Pengyuan Zhang, Ta Li, and Yonghong Yan. 2019. Semantic Features Based N-Best Rescoring Methods for Automatic Speech Recognition. Applies Sciences 9(23):5053 (2019).Google Scholar
- Yanhua Long, Yijie Li, Shuang Wei, Qiaozheng Zhang, and Chunxia Yang. 2019. Large-Scale Semi-Supervised Training in Deep Learning Acoustic Model for ASR. IEEE Access 7(2019), 133615–133627.Google Scholar
- A. Mani 2020. ASR Error Correction and Domain Adaptation Using Machine Translation. In IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP).Google Scholar
- Peters Matthew 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 2227–2237. https://doi.org/10.18653/v1/N18-1202Google Scholar
- Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. CoRR abs/1301.3781(2013).Google Scholar
- Gwendolyn B Moore 1977. Accessing Individual Records from Personal Data Files Using Non-Unique Identifiers. Final Report. Computer Science & Technology Series.(1977).Google Scholar
- Ryohei Nakatani, Tetsuya Takiguchi, and Yasuo Ariki. 2013. Two-step correction of speech recognition errors based on n-gram and long contextual information. In INTERSPEECH.Google Scholar
- J. M. Noyes and C. R. Frankish. 1994. Errors and error correction in automatic speech recognition systems. Ergonomics 37, 11 (1994), 1943–1957.Google ScholarCross Ref
- Daniel Povey 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.Google Scholar
- J. R. Quinlan. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (March 1986), 81–106.Google ScholarDigital Library
- Brian Roark, Murat Saraclar, and Michael Collins. 2007. Discriminative n-gram language modeling. Computer Speech & Language 21, 2 (2007), 373 – 392. https://doi.org/10.1016/j.csl.2006.06.006Google ScholarDigital Library
- George Saon 2017. English Conversational Telephone Speech Recognition by Humans and Machines. In Proc. Interspeech 2017. 132–136.Google ScholarCross Ref
- Arup Sarma 2004. Context-Based Speech Recognition Error Detection and Correction. In Proc. of HLT-NAACL 2004: Short Papers (Boston, Massachusetts). Assn. for Computational Linguistics, 85–88.Google ScholarDigital Library
- A. R. Setlur 1996. Correcting recognition errors via discriminative utterance verification. In Proc. of 4th Intl. Conf. on Spoken Language Processing., Vol. 2. 602–605.Google ScholarCross Ref
- Prashanth Gurunath Shivakumar 2019. Learning from past mistakes: improving automatic speech recognition output via noisy-clean phrase context modeling. APSIPA Trans. on Signal and Information Processing 8 (2019).Google Scholar
- Y. Tam 2014. ASR error detection using recurrent neural network language model and complementary ASR. In IEEE Intl. Conf. on Acoustics, Speech and Signal Processing.Google ScholarCross Ref
- P.C. Woodland and D. Povey. 2002. Large scale discriminative training of hidden Markov models for speech recognition. Computer Speech & Language 16, 1 (2002), 25 – 47. https://doi.org/10.1006/csla.2001.0182Google ScholarDigital Library
- Xiaodong Cui, Liang Gu, Bing Xiang, Wei Zhang, and Yuqing Gao. 2008. Developing high performance ASR in the IBM multilingual speech-to-speech translation system. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. 5121–5124.Google Scholar
- Wayne Xiong 2017. Toward Human Parity in Conversational Speech Recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 25, 12 (Dec. 2017), 2410–2423.Google Scholar
- Dong Yu and Li Deng. 2015. Automatic Speech Recognition. Springer-Verlag, London.Google Scholar
- Zhengyu Zhou 2006. A multi-pass error detection and correction framework for Mandarin LVCSR. In INTERSPEECH.Google Scholar
Recommendations
Continuous Punjabi speech recognition model based on Kaldi ASR toolkit
In this paper, continuous Punjabi speech recognition model is presented using Kaldi toolkit. For speech recognition, the extraction of Mel frequency cepstral coefficients (MFCC) features and perceptual linear prediction (PLP) features were extracted ...
Research of Automatic Speech Recognition of Asante-Twi Dialect For Translation
EITCE '21: Proceedings of the 2021 5th International Conference on Electronic Information Technology and Computer EngineeringThis paper presents a new way of building low-resourced dialect Automatic Speech Recognition (ASR) systems using a small database using the Asante-Twi dialect. Three different ASR systems with different features and methods have been tested and tried ...
An efficient multistage rover method for automatic speech recognition
ICME'09: Proceedings of the 2009 IEEE international conference on Multimedia and ExpoIn this paper, we implemented a multistage Recognizer Output Voting Error Reduction (ROVER) method for better Automatic Speech Recognition (ASR). The first stage ROVER is conducted by combining three recognizers, which are respectively trained with ...
Comments