Abstract
How can we accurately transcribe speech signals into texts when only a portion of them are annotated? ASR (Automatic Speech Recognition) systems are extensively utilized in many real-world applications including automatic translation systems and transcription services. Due to the exponential growth of available speech data without annotations and the significant costs of manual labeling, semi-supervised ASR approaches have garnered attention. Such scenarios include transcribing videos in streaming platforms, where a vast amount of content is uploaded daily but only a fraction of them are transcribed manually. Previous approaches for semi-supervised ASR use a pseudo labeling scheme to incorporate unlabeled examples during training. Nevertheless, their effectiveness is restricted as they do not take into account the uncertainty linked to the pseudo labels when using them as labels for unlabeled cases. In this paper, we propose MOCA (), an accurate framework for semi-supervised ASR. MOCA generates multiple hypotheses for each speech instance to consider the uncertainty of the pseudo label. Furthermore, MOCA considers the various degrees of uncertainty in pseudo labels across speech instances, enabling a robust training on the uncertain dataset. Extensive experiments on real-world speech datasets show that MOCA successfully improves the transcription performance of previous ASR models.
J. Kim and K. H. Park—These authors contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Al-Zakarya, M.A., Al-Irhaim, Y.F.: Unsupervised and semi-supervised speech recognition system: a review. AL-Rafidain J. Comput. Sci. Math. (2023)
Baevski, A., Mohamed, A.: Effectiveness of self-supervised pre-training for ASR. In: ICASSP (2020)
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: Wav2Vec 2.0: a framework for self-supervised learning of speech representations. In: NeurIPS (2020)
Cantiabela, Z.: Deep learning for robust speech command recognition using convolutional neural networks (CNN). In: IC3INA (2022)
Chen, Y., Wang, W., Wang, C.: Semi-supervised ASR by end-to-end self-training. arXiv:2001.09128 (2020)
D’Haro, L.F., Banchs, R.E.: Automatic correction of ASR outputs by using machine translation. In: INTERSPEECH (2016)
Higuchi, Y., Karube, K., Ogawa, T., Kobayashi, T.: Hierarchical conditional end-to-end ASR with CTC and multi-granular subword units. In: ICASSP (2022)
Javanmardi, F., Tirronen, S., Kodali, M., Kadiri, S.R., Alku, P.: Wav2Vec-based detection and severity level classification of dysarthria from speech. In: ICASSP (2023)
Korkut, C., Haznedaroglu, A., Arslan, L.: Comparison of deep learning methods for spoken language identification. In: SPECOM (2020)
Kreyssig, F.L., Shi, Y., Guo, J., Sari, L., Mohamed, A., Woodland, P.C.: Biased self-supervised learning for ASR. CoRR (2022)
Liu, M., Ke, Y., Zhang, Y., Shao, W., Song, L.: Speech emotion recognition based on deep learning. In: TENCON (2022)
Long, Y., Li, Y., Wei, S., Zhang, Q., Yang, C.: Large-scale semi-supervised training in deep learning acoustic model for ASR. IEEE Access (2019)
Nguyen, Q., Valizadegan, H., Hauskrecht, M.: Learning classification models with soft-label information. J. Am. Med. Inf. Assoc. (2014)
Nguyen, T.N., Pham, N.-Q., Waibel, A.: Accent conversion using pre-trained model and synthesized data from voice conversion. In: Interspeech (2022)
Novotney, S., Schwartz, R.M., Ma, J.Z.: Unsupervised acoustic and language model training with small amounts of labelled data. In: ICASSP (2009)
Daniel, S., et al.: Improved noisy student training for automatic speech recognition. In: INTERSPEECH (2020)
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: ICML (2023)
Rouhe, A., Virkkunen, A., Leinonen, J., Kurimo, M., et al.: Low resource comparison of attention-based and hybrid ASR exploiting Wav2Vec 2.0. In: Interspeech (2022)
Schneider, S., Baevski, A., Collobert, R., Auli, M.: Wav2Vec: unsupervised pre-training for speech recognition. In: INTERSPEECH (2019)
Shan, C., Zhang, J., Wang, Y., Xie, L.: Attention-based end-to-end speech recognition on voice search. In: ICASSP (2018)
Vyas, A., Madikeri, S.R., Bourlard, H.: Comparing CTC and LFMMI for out-of-domain adaptation of Wav2Vec 2.0 acoustic model. In: Interspeech (2021)
Weninger, F., Mana, F., Gemello, R., Andrés-Ferrer, J., Zhan, P.: Semi-supervised learning with data augmentation for end-to-end ASR. In: INTERSPEECH (2020)
Xu, Q., Likhomanenko, T., Kahn, J., Hannun, A.Y., Synnaeve, G., Collobert, R.: Iterative pseudo-labeling for speech recognition. In: INTERSPEECH (2020)
Zhao, X., et al.: Disentangling content and fine-grained prosody information via hybrid ASR bottleneck features for voice conversion. In: ICASSP (2022)
Acknowledgement
This work was supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) [No. 2022-0-00641, XVoice: Multi-Modal Voice Meta Learning], [No. 2021-0-01343, Artificial Intelligence Graduate School Program (Seoul National University)], and [NO. 2021-0-02068, Artificial Intelligence Innovation Hub (Artificial Intelligence Institute, Seoul National University)].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kim, J., Park, K.H., Kang, U. (2024). Accurate Semi-supervised Automatic Speech Recognition via Multi-hypotheses-Based Curriculum Learning. In: Yang, DN., Xie, X., Tseng, V.S., Pei, J., Huang, JW., Lin, J.CW. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2024. Lecture Notes in Computer Science(), vol 14649. Springer, Singapore. https://doi.org/10.1007/978-981-97-2262-4_4
Download citation
DOI: https://doi.org/10.1007/978-981-97-2262-4_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2264-8
Online ISBN: 978-981-97-2262-4
eBook Packages: Computer ScienceComputer Science (R0)