Comparison of Automatic Speech Recognition Systems

Kim, Joshua Y.; Liu, Chunfeng; Calvo, Rafael A.; McCabe, Kathryn; Taylor, Silas C. R.; Schuller, Björn W.; Wu, Kaihang

doi:10.1007/978-981-19-5538-9_8

Joshua Y. Kim⁴⁰,
Chunfeng Liu⁴¹,
Rafael A. Calvo⁴²,
Kathryn McCabe⁴³,
Silas C. R. Taylor⁴⁴,
Björn W. Schuller⁴² &
…
Kaihang Wu⁴⁰

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 943))

480 Accesses
1 Citations

Abstract

High-quality transcription systems are required for conversational analysis systems. We compared two manual transcribers with five automatic transcription systems using video conferences from a medical domain and found that (1) manual transcriptions significantly outperformed the automatic services, and (2) the automatic transcription of YouTube Captions significantly outperformed the other ASR services.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 279.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Belambert: Asr-evaluation. https://github.com/belambert/asr-evaluation
Carletta J (2007) Unleashing the killer corpus: experiences in creating the multi-everything ami meeting corpus. Lang Resour Eval 41(2):181–190
Article Google Scholar
Chiu CC, Sainath TN, Wu Y, Prabhavalkar R, Nguyen P, Chen Z, Kannan A, Weiss RJ, Rao K, Gonina E, et al (2018) State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4774–4778
Google Scholar
Gaikwad SK, Gawali BW, Yannawar P (2010) A review on speech recognition technique. Int J Comput Appl 10(3):16–24
Google Scholar
Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS (1993) Darpa timit acoustic-phonetic continuous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n 93, 27403
Google Scholar
Gillick L, Cox SJ (1989) Some statistical issues in the comparison of speech recognition algorithms. In: International conference on acoustics, speech, and signal processing. IEEE, pp 532–535
Google Scholar
Gopal RK, Solanki P, Bokhour B, Skorohod N, Hernandez-Lujan D, Gordon H (2021) Provider, staff, and patient perspectives on medical visits using clinical video telehealth: a foundation for educational initiatives to improve medical care in telehealth. J Nurse Practit
Google Scholar
Gordon HS, Solanki P, Bokhour BG, Gopal RK (2020) “i’m not feeling like i’m part of the conversation’’ patients’ perspectives on communicating in clinical video telehealth visits. J Gen Intern Med 35(6):1751–1758
Article Google Scholar
Hazarika D, Poria S, Mihalcea R, Cambria E, Zimmermann R (2018) Icon: interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 2594–2604
Google Scholar
Hazarika D, Poria S, Zadeh A, Cambria E, Morency LP, Zimmermann R (2018) Conversational memory network for emotion recognition in dyadic dialogue videos. In: Proceedings of the conference. Association for computational linguistics. North American Chapter. Meeting, vol 2018, p 2122. NIH Public Access
Google Scholar
Henton C (2005) Bitter pills to swallow. asr and tts have drug problems. Int J Speech Technol 8(3), 247–257
Google Scholar
James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning, vol 112. Springer
Google Scholar
Këpuska V, Bohouta G (2017) Comparing speech recognition systems (microsoft api, google api and cmu sphinx). Int J Eng Res Appl 7(03):20–24
Google Scholar
Kim JY, Calvo RA, Yacef K, Enfield N (2019) A review on dyadic conversation visualizations-purposes, data, lens of analysis. arXiv:1905.00653
Kim JY, Kim GY, Yacef K (2019) Detecting depression in dyadic conversations with multimodal narratives and visualizations. In: Australasian joint conference on artificial intelligence. Springer, pp 303–314
Google Scholar
Kim JY, Yacef K, Kim G, Liu C, Calvo R, Taylor S (2021) Monah: multi-modal narratives for humans to analyze conversations. In: Proceedings of the 16th conference of the European chapter of the association for computational linguistics: main volume, pp 466–479
Google Scholar
LeCun Y, Bengio Y et al (1995) Convolutional networks for images, speech, and time series. Handbook of Brain Theory and Neural Netw 3361(10):1995
Google Scholar
Li J, Zhao R, Chen Z, Liu C, Xiao X, Ye G, Gong Y (2018) Developing far-field speaker system via teacher-student learning. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5699–5703
Google Scholar
Liu C, Lim RL, McCabe KL, Taylor S, Calvo RA (2016) A web-based telehealth training platform incorporating automated nonverbal behavior feedback for teaching communication skills to medical students: a randomized crossover study. J Med Internet Res 18(9):e246
Google Scholar
Liu C, Scott KM, Lim RL, Taylor S, Calvo RA (2016) Eqclinic: a platform for learning communication skills in clinical consultations. Med Educ Online 21(1):31801
Article Google Scholar
Majumder N, Poria S, Hazarika D, Mihalcea R, Gelbukh A, Cambria E (2019) Dialoguernn: An attentive rnn for emotion detection in conversations. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 6818–6825
Google Scholar
Mani A, Palaskar S, Konam S (2020) Towards understanding asr error correction for medical conversations. In: Proceedings of the first workshop on natural language processing for medical conversations, pp 7–11
Google Scholar
Miao K, Biermann O, Miao Z, Leung S, Wang J, Gai k (2020) integrated parallel system for audio conferencing voice transcription and speaker identification. In: 2020 international conference on high performance big data and intelligent systems (HPBD &IS). IEEE, pp 1–8
Google Scholar
Mittal T, Bhattacharya U, Chandra R, Bera A, Manocha D (2020) M3er: multiplicative multimodal emotion recognition using facial, textual, and speech cues. In: AAAI, pp 1359–1367
Google Scholar
Nielsen C, Agerskov H, Bistrup C, Clemensen J (2020) Evaluation of a telehealth solution developed to improve follow-up after kidney transplantation. J Clin Nurs 29(7–8):1053–1063
Article Google Scholar
Renals S, Swietojanski P (2017) Distant speech recognition experiments using the AMI corpus. New Era for robust speech recognition, pp 355–368
Google Scholar
Roy BC, Roy DK, Vosoughi S (2010) Automatic estimation of transcription accuracy and difficulty
Google Scholar
Saon G, Kuo HKJ, Rennie S, Picheny M (2015) The IBM 2015 english conversational telephone speech recognition system. arXiv:1505.05899
Siohan O, Ramabhadran B, Kingsbury B (2005) Constructing ensembles of asr systems using randomized decision trees. In: Proceedings.(ICASSP’05). IEEE international conference on acoustics, speech, and signal processing, 2005. vol 1. IEEE, pp I–197
Google Scholar
Swietojanski P, Ghoshal A, Renals S (2014) Convolutional neural networks for distant speech recognition. IEEE Signal Process Lett 21(9):1120–1124
Article Google Scholar
Tang Z, Meng HY, Manocha D (2020) Low-frequency compensated synthetic impulse responses for improved far-field speech recognition. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6974–6978
Google Scholar
Xiong W, Droppo J, Huang X, Seide F, Seltzer M, Stolcke A, Yu D, Zweig G (2016) Achieving human parity in conversational speech recognition. arXiv:1610.05256
Xiong W, Wu L, Alleva F, Droppo J, Huang X, Stolcke A (2018) The microsoft 2017 conversational speech recognition system. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5934–5938
Google Scholar
Zadeh A, Liang PP, Mazumder N, Poria S, Cambria E, Morency LP (2018) Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Google Scholar
Zhao T, Zhao Y, Wang S, Han M (2021) Unet++-based multi-channel speech dereverberation and distant speech recognition. In: 2021 12th international symposium on Chinese spoken language processing (ISCSLP). IEEE, pp 1–5
Google Scholar

Download references

Acknowledgements

The authors thank Hicham Moad S for his help rendered in scripting for the Microsoft Azure API, and Marriane Makahiya for typesetting. RAC is partially funded by the Australian Research Council Future Fellowship FT140100824.

Author information

Authors and Affiliations

The University of Sydney, Sydney, Australia
Joshua Y. Kim & Kaihang Wu
Hello Sunday Morning, Surry Hills, Australia
Chunfeng Liu
Imperial College London, London, United Kingdom
Rafael A. Calvo & Björn W. Schuller
University of California, Los Angeles, CA, USA
Kathryn McCabe
University of New South Wales, Kensington, NSW, Australia
Silas C. R. Taylor

Authors

Joshua Y. Kim
View author publications
You can also search for this author in PubMed Google Scholar
Chunfeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Rafael A. Calvo
View author publications
You can also search for this author in PubMed Google Scholar
Kathryn McCabe
View author publications
You can also search for this author in PubMed Google Scholar
Silas C. R. Taylor
View author publications
You can also search for this author in PubMed Google Scholar
Björn W. Schuller
View author publications
You can also search for this author in PubMed Google Scholar
Kaihang Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rafael A. Calvo .

Editor information

Editors and Affiliations

Toshiba (United Kingdom), Weybridge, UK
Svetlana Stoyanchev
Daimler (Germany), Stuttgart, Germany
Stefan Ultes
The Chinese University of Hong Kong, Shenzhen, China
Haizhou Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, J.Y. et al. (2022). Comparison of Automatic Speech Recognition Systems. In: Stoyanchev, S., Ultes, S., Li, H. (eds) Conversational AI for Natural Human-Centric Interaction. Lecture Notes in Electrical Engineering, vol 943. Springer, Singapore. https://doi.org/10.1007/978-981-19-5538-9_8

Download citation

DOI: https://doi.org/10.1007/978-981-19-5538-9_8
Published: 01 November 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-5537-2
Online ISBN: 978-981-19-5538-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics