research-article

Lightweight domain adaptation: A filtering pipeline to improve accuracy of an Automatic Speech Recognition (ASR) engine

Authors:
Jordan Hosier

Vail Systems, Inc., USA

Vail Systems, Inc., USA
View Profile

,
Yu Zhou

Vail Systems, Inc., USA

Vail Systems, Inc., USA
View Profile

,
Nikhita Sharma

Vail Systems, Inc., USA

Vail Systems, Inc., USA
View Profile

,
Vijay K. Gurbani

Vail Systems, Inc., USA

Vail Systems, Inc., USA
View Profile

ACAI '21: Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial IntelligenceDecember 2021Article No.: 95Pages 1–9https://doi.org/10.1145/3508546.3508641

Published:25 February 2022Publication History

ACAI '21: Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence

Pages 1–9

ABSTRACT

Transformer models have accelerated the field of speech recognition; deriving a low word error rate (WER) is demonstrably achievable under varying conditions. However, most ASR engines are trained on acoustic and language models constructed from corpora that include news feeds, books, and blogs in order to demonstrate generalization, leading to errors when the model is applied to a specific domain. While the increase in WER is acute for very specific domains (health and medicine), our work shows that it is sizable even when the domain is general (hospitality). For such domains, a lightweight adaptation approach can help; lightweight because the adaptation does not require extensive post-hoc training of additional domain-specific acoustic- or language-models that act as adjutants to the base ASR engine. We present our work on such lightweight filtering pipeline that seamlessly integrates lightweight models (n − gram, decision trees) with powerful, pre-trained, bi-directional transformer models, all working in conjunction to derive a 1-best hypothesis word selection algorithm. Our pipeline reduces the WER between 1.6% to 2.5% absolute while treating the ASR engine as a black box, and without requiring additional complex discriminative training.

References

W. A. Ainsworth and S. R. Pratt. 1992. Feedback Strategies for Error Correction in Speech Recognition Systems. Int. J. Man-Mach. Stud. 36, 6 (June 1992), 833–842.Google ScholarDigital Library
Youssef Bassil and Paul Semaan. 2012. ASR Context-Sensitive Error Correction Based on Microsoft N-Gram Dataset. ArXiv abs/1203.5262(2012).Google Scholar
Tom B. Brown 2020. Language Models are Few-Shot Learners. arxiv:2005.14165 [cs.CL]Google Scholar
Xiaodong Cui, Wei Zhang, Ulrich Finkler, George Saon, Michael Picheny, and David S. Kung. 2020. Distributed Training of Deep Neural Network Acoustic Models for Automatic Speech Recognition: A comparison of current training strategies. IEEE Signal Process. Mag. 37, 3 (2020), 39–49. https://doi.org/10.1109/MSP.2020.2969859Google ScholarCross Ref
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107–113. https://doi.org/10.1145/1327452.1327492Google ScholarDigital Library
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv e-prints, Article arXiv:1810.04805 (Oct. 2018), arXiv:1810.04805 pages. arxiv:1810.04805 [cs.CL]Google Scholar
Luis D’Haro and Rafael Banchs. 2016. Automatic Correction of ASR Outputs by Using Machine Translation. In Interspeech. 3469–3473.Google Scholar
Rahhal Errattahi, Asmaa El Hannani, and Hassan Ouahmane. 2018. Automatic speech recognition errors detection and correction: A review. Procedia Computer Science 128 (2018), 32–37.Google ScholarCross Ref
Yohei Fusayasu, Katsuyuki Tanaka, Tetsuya Takiguchi, and Yasuo Ariki. 2015. Word-Error Correction of Continuous Speech Recognition Based on Normalized Relevance Distance. In IJCAI.Google Scholar
Jordan Hosier, Vijay K Gurbani, and Neil Milstead. 2019. Disambiguation and Error Resolution in Call Transcripts. In 2019 IEEE International Conference on Big Data (Big Data). IEEE, 4602–4607.Google Scholar
Chang Liu, Pengyuan Zhang, Ta Li, and Yonghong Yan. 2019. Semantic Features Based N-Best Rescoring Methods for Automatic Speech Recognition. Applies Sciences 9(23):5053 (2019).Google Scholar
Yanhua Long, Yijie Li, Shuang Wei, Qiaozheng Zhang, and Chunxia Yang. 2019. Large-Scale Semi-Supervised Training in Deep Learning Acoustic Model for ASR. IEEE Access 7(2019), 133615–133627.Google Scholar
A. Mani 2020. ASR Error Correction and Domain Adaptation Using Machine Translation. In IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP).Google Scholar
Peters Matthew 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 2227–2237. https://doi.org/10.18653/v1/N18-1202Google Scholar
Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. CoRR abs/1301.3781(2013).Google Scholar
Gwendolyn B Moore 1977. Accessing Individual Records from Personal Data Files Using Non-Unique Identifiers. Final Report. Computer Science & Technology Series.(1977).Google Scholar
Ryohei Nakatani, Tetsuya Takiguchi, and Yasuo Ariki. 2013. Two-step correction of speech recognition errors based on n-gram and long contextual information. In INTERSPEECH.Google Scholar
J. M. Noyes and C. R. Frankish. 1994. Errors and error correction in automatic speech recognition systems. Ergonomics 37, 11 (1994), 1943–1957.Google ScholarCross Ref
Daniel Povey 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.Google Scholar
J. R. Quinlan. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (March 1986), 81–106.Google ScholarDigital Library
Brian Roark, Murat Saraclar, and Michael Collins. 2007. Discriminative n-gram language modeling. Computer Speech & Language 21, 2 (2007), 373 – 392. https://doi.org/10.1016/j.csl.2006.06.006Google ScholarDigital Library
George Saon 2017. English Conversational Telephone Speech Recognition by Humans and Machines. In Proc. Interspeech 2017. 132–136.Google ScholarCross Ref
Arup Sarma 2004. Context-Based Speech Recognition Error Detection and Correction. In Proc. of HLT-NAACL 2004: Short Papers (Boston, Massachusetts). Assn. for Computational Linguistics, 85–88.Google ScholarDigital Library
A. R. Setlur 1996. Correcting recognition errors via discriminative utterance verification. In Proc. of 4th Intl. Conf. on Spoken Language Processing., Vol. 2. 602–605.Google ScholarCross Ref
Prashanth Gurunath Shivakumar 2019. Learning from past mistakes: improving automatic speech recognition output via noisy-clean phrase context modeling. APSIPA Trans. on Signal and Information Processing 8 (2019).Google Scholar
Y. Tam 2014. ASR error detection using recurrent neural network language model and complementary ASR. In IEEE Intl. Conf. on Acoustics, Speech and Signal Processing.Google ScholarCross Ref
P.C. Woodland and D. Povey. 2002. Large scale discriminative training of hidden Markov models for speech recognition. Computer Speech & Language 16, 1 (2002), 25 – 47. https://doi.org/10.1006/csla.2001.0182Google ScholarDigital Library
Xiaodong Cui, Liang Gu, Bing Xiang, Wei Zhang, and Yuqing Gao. 2008. Developing high performance ASR in the IBM multilingual speech-to-speech translation system. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. 5121–5124.Google Scholar
Wayne Xiong 2017. Toward Human Parity in Conversational Speech Recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 25, 12 (Dec. 2017), 2410–2423.Google Scholar
Dong Yu and Li Deng. 2015. Automatic Speech Recognition. Springer-Verlag, London.Google Scholar
Zhengyu Zhou 2006. A multi-pass error detection and correction framework for Mandarin LVCSR. In INTERSPEECH.Google Scholar

Recommendations

Continuous Punjabi speech recognition model based on Kaldi ASR toolkit

In this paper, continuous Punjabi speech recognition model is presented using Kaldi toolkit. For speech recognition, the extraction of Mel frequency cepstral coefficients (MFCC) features and perceptual linear prediction (PLP) features were extracted ...
Read More
Research of Automatic Speech Recognition of Asante-Twi Dialect For Translation
EITCE '21: Proceedings of the 2021 5th International Conference on Electronic Information Technology and Computer Engineering

This paper presents a new way of building low-resourced dialect Automatic Speech Recognition (ASR) systems using a small database using the Asante-Twi dialect. Three different ASR systems with different features and methods have been tested and tried ...
Read More
An efficient multistage rover method for automatic speech recognition
ICME'09: Proceedings of the 2009 IEEE international conference on Multimedia and Expo

In this paper, we implemented a multistage Recognizer Output Voting Error Reduction (ROVER) method for better Automatic Speech Recognition (ASR). The first stage ROVER is conducted by combining three recognizers, which are respectively trained with ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ACAI '21: Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence
December 2021
699 pages
ISBN:9781450385053
DOI:10.1145/3508546

Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 February 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
ASR
BERT
decision trees
domain adaptation
filtering pipeline
transformer architecture
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate173of395submissions,44%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 85
  Total Downloads
- Downloads (Last 12 months)21
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Lightweight domain adaptation: A filtering pipeline to improve accuracy of an Automatic Speech Recognition (ASR) engine

ACAI '21: Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence

ABSTRACT

References

Cited By

Recommendations

Continuous Punjabi speech recognition model based on Kaldi ASR toolkit

Research of Automatic Speech Recognition of Asante-Twi Dialect For Translation

An efficient multistage rover method for automatic speech recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Lightweight domain adaptation: A filtering pipeline to improve accuracy of an Automatic Speech Recognition (ASR) engine

ACAI '21: Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence

ABSTRACT

References

Cited By

Recommendations

Continuous Punjabi speech recognition model based on Kaldi ASR toolkit

Research of Automatic Speech Recognition of Asante-Twi Dialect For Translation

An efficient multistage rover method for automatic speech recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media