Abstract
Automatic speech recognition (ASR) technologies can provide significant efficiency gains in the health sector, by saving time and financial resources, allowing specialists to shift more time to high-value activities.
Creating customized ASR models requires domain- and task-related transcribed speech data. Unfortunately, producing such data usually is too expensive for medical institutions: it requires a lot of financial, human resources, and expertise. Consequently, his paper explores a semi-supervised medical domain adaptation method for the Latvian language that benefits from the untranscribed speech recordings. For the initial model, we use the currently available general-purpose hybrid ASR system with the core of a lattice-free maximum mutual information method used to train its acoustic model. The initial system is applied to the domain-related untranscribed data to extract sequences of pseudo-labels. Such automatic transcriptions are later added to the supervised and used together to update the acoustic model. To improve our ASR system further, we have also updated its language model with additional in-domain texts.
We have achieved significant improvements in the quality of speech recognition on all evaluation datasets. On the epicrises, psychiatry, and radiology datasets word error rate (WER) decreased by 39%, 27%–29%, and 21%, respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Chen, Y., Wang, W., Wang, C.: Semi-supervised ASR by end-to-end self-training. arXiv abs/2001.09128 (2020)
Errattahi, R., El Hannani, A., Salmam, F.Z., Ouahmane, H.: Incorporating label dependency for ASR error detection via RNN. Procedia Comput. Sci. 148, 266–272 (2019)
Grezl, F., Karafiát, M.: Semi-supervised bootstrapping approach for neural network feature extractor training. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 470–475. IEEE (2013)
Gruzitis, N., Dargis, R., Lasmanis, V.J., Garkaje, G., Gosko, D.: Adapting automatic speech recognition to the radiology domain for a less-resourced language: the case of Latvian. In: Nagar, A.K., Jat, D.S., Marín-Raventós, G., Mishra, D.K. (eds.) Intelligent Sustainable Systems. LNNS, vol. 333, pp. 267–276. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-6309-3_27
Heafield, K.: Kenlm: faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197 (2011)
Kahn, J., Lee, A., Hannun, A.: Self-training for end-to-end speech recognition. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7084–7088. IEEE (2020)
Khonglah, B.K., Madikeri, S.R., Dey, S., Bourlard, H., Motlícek, P., Billa, J.: Incremental semi-supervised learning for multi-genre speech recognition. In: ICASSP, pp. 7419–7423. IEEE (2020)
Lybarger, K., Ostendorf, M., Yetisgen, M.: Automatically detecting likely edits in clinical notes created using automatic speech recognition. In: AMIA Annual Symposium Proceedings, vol. 2017, p. 1186. American Medical Informatics Association (2017)
Manohar, V., Hadian, H., Povey, D., Khudanpur, S.: Semi-supervised training of acoustic models using lattice-free mmi. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4844–4848. IEEE (2018)
Pinnis, M., Auziņa, I., Goba, K.: Designing the Latvian speech recognition corpus. In: Proceedings of the 9th Edition of the Language Resources and Evaluation Conference (LREC 2014), pp. 1547–1553 (2014)
Pinnis, M., Salimbajevs, A., Auzina, I.: Designing a speech corpus for the development and evaluation of dictation systems in Latvian. In: Chair, N.C.C., et al. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA), Paris, France (2016)
Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, December 2011. iEEE Catalog No.: CFP11SRW-USB
Povey, D., et al.: Purely sequence-trained neural networks for ASR based on lattice-free MMI. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 08-12-Sept, pp. 2751–2755 (2016). https://doi.org/10.21437/Interspeech.2016-595
Salimbajevs, A.: Creating lithuanian and Latvian speech corpora from inaccurately annotated web data. In: Calzolari, N., et al. (eds.) Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, 7–12 May 2018. European Language Resources Association (ELRA) (2018). http://www.lrec-conf.org/proceedings/lrec2018/summaries/258.html
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), pp. 1715–1725 (2016)
Sheikh, I., Vincent, E., Illina, I.: On semi-supervised LF-MMI training of acoustic models with limited data. In: INTERSPEECH 2020, Shanghai, China (2020). https://hal.inria.fr/hal-02907924
Singh, K., et al.: Large scale weakly and semi-supervised learning for low-resource video ASR. In: INTERSPEECH, pp. 3770–3774. ISCA (2020)
Synnaeve, G., et al.: End-to-end ASR: from supervised to semi-supervised learning with modern architectures. CoRR abs/1911.08460 (2019). http://arxiv.org/abs/1911.08460
Tam, Y.C., Lei, Y., Zheng, J., Wang, W.: ASR error detection using recurrent neural network language model and complementary ASR. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2312–2316. IEEE (2014)
Thomas, S., Seltzer, M.L., Church, K., Hermansky, H.: Deep neural network features and semi-supervised training for low resource speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6704–6708. IEEE (2013)
Veselỳ, K., Burget, L., Cernockỳ, J.: Semi-supervised DNN training with word selection for ASR. In: Interspeech, pp. 3687–3691 (2017)
Wallington, E., Kershenbaum, B., Klejch, O., Bell, P.: On the learning dynamics of semi-supervised training for ASR. In: Proceedings of Interspeech 2021, pp. 716–720 (2021). https://doi.org/10.21437/Interspeech.2021-1777
Zhang, P., Liu, Y., Hain, T.: Semi-supervised DNN training in meeting recognition. In: 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 141–146. IEEE (2014)
Acknowledgements
This research has been supported by the ICT Competence Centre (www.itkc.lv) within the project “2.8. Automated voice communication solutions for the healthcare industry” of EU Structural funds, ID no 1.2.1.1/18/A/003.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Salimbajevs, A., Kapočiūtė-Dzikienė, J. (2022). Automatic Speech Recognition Model Adaptation to Medical Domain Using Untranscribed Audio. In: Ivanovic, M., Kirikova, M., Niedrite, L. (eds) Digital Business and Intelligent Systems. Baltic DB&IS 2022. Communications in Computer and Information Science, vol 1598. Springer, Cham. https://doi.org/10.1007/978-3-031-09850-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-09850-5_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-09849-9
Online ISBN: 978-3-031-09850-5
eBook Packages: Computer ScienceComputer Science (R0)