Skip to main content

Accurate Semi-supervised Automatic Speech Recognition via Multi-hypotheses-Based Curriculum Learning

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14649))

Included in the following conference series:

  • 150 Accesses

Abstract

How can we accurately transcribe speech signals into texts when only a portion of them are annotated? ASR (Automatic Speech Recognition) systems are extensively utilized in many real-world applications including automatic translation systems and transcription services. Due to the exponential growth of available speech data without annotations and the significant costs of manual labeling, semi-supervised ASR approaches have garnered attention. Such scenarios include transcribing videos in streaming platforms, where a vast amount of content is uploaded daily but only a fraction of them are transcribed manually. Previous approaches for semi-supervised ASR use a pseudo labeling scheme to incorporate unlabeled examples during training. Nevertheless, their effectiveness is restricted as they do not take into account the uncertainty linked to the pseudo labels when using them as labels for unlabeled cases. In this paper, we propose MOCA (), an accurate framework for semi-supervised ASR. MOCA generates multiple hypotheses for each speech instance to consider the uncertainty of the pseudo label. Furthermore, MOCA considers the various degrees of uncertainty in pseudo labels across speech instances, enabling a robust training on the uncertain dataset. Extensive experiments on real-world speech datasets show that MOCA successfully improves the transcription performance of previous ASR models.

J. Kim and K. H. Park—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Al-Zakarya, M.A., Al-Irhaim, Y.F.: Unsupervised and semi-supervised speech recognition system: a review. AL-Rafidain J. Comput. Sci. Math. (2023)

    Google Scholar 

  2. Baevski, A., Mohamed, A.: Effectiveness of self-supervised pre-training for ASR. In: ICASSP (2020)

    Google Scholar 

  3. Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: Wav2Vec 2.0: a framework for self-supervised learning of speech representations. In: NeurIPS (2020)

    Google Scholar 

  4. Cantiabela, Z.: Deep learning for robust speech command recognition using convolutional neural networks (CNN). In: IC3INA (2022)

    Google Scholar 

  5. Chen, Y., Wang, W., Wang, C.: Semi-supervised ASR by end-to-end self-training. arXiv:2001.09128 (2020)

  6. D’Haro, L.F., Banchs, R.E.: Automatic correction of ASR outputs by using machine translation. In: INTERSPEECH (2016)

    Google Scholar 

  7. Higuchi, Y., Karube, K., Ogawa, T., Kobayashi, T.: Hierarchical conditional end-to-end ASR with CTC and multi-granular subword units. In: ICASSP (2022)

    Google Scholar 

  8. Javanmardi, F., Tirronen, S., Kodali, M., Kadiri, S.R., Alku, P.: Wav2Vec-based detection and severity level classification of dysarthria from speech. In: ICASSP (2023)

    Google Scholar 

  9. Korkut, C., Haznedaroglu, A., Arslan, L.: Comparison of deep learning methods for spoken language identification. In: SPECOM (2020)

    Google Scholar 

  10. Kreyssig, F.L., Shi, Y., Guo, J., Sari, L., Mohamed, A., Woodland, P.C.: Biased self-supervised learning for ASR. CoRR (2022)

    Google Scholar 

  11. Liu, M., Ke, Y., Zhang, Y., Shao, W., Song, L.: Speech emotion recognition based on deep learning. In: TENCON (2022)

    Google Scholar 

  12. Long, Y., Li, Y., Wei, S., Zhang, Q., Yang, C.: Large-scale semi-supervised training in deep learning acoustic model for ASR. IEEE Access (2019)

    Google Scholar 

  13. Nguyen, Q., Valizadegan, H., Hauskrecht, M.: Learning classification models with soft-label information. J. Am. Med. Inf. Assoc. (2014)

    Google Scholar 

  14. Nguyen, T.N., Pham, N.-Q., Waibel, A.: Accent conversion using pre-trained model and synthesized data from voice conversion. In: Interspeech (2022)

    Google Scholar 

  15. Novotney, S., Schwartz, R.M., Ma, J.Z.: Unsupervised acoustic and language model training with small amounts of labelled data. In: ICASSP (2009)

    Google Scholar 

  16. Daniel, S., et al.: Improved noisy student training for automatic speech recognition. In: INTERSPEECH (2020)

    Google Scholar 

  17. Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: ICML (2023)

    Google Scholar 

  18. Rouhe, A., Virkkunen, A., Leinonen, J., Kurimo, M., et al.: Low resource comparison of attention-based and hybrid ASR exploiting Wav2Vec 2.0. In: Interspeech (2022)

    Google Scholar 

  19. Schneider, S., Baevski, A., Collobert, R., Auli, M.: Wav2Vec: unsupervised pre-training for speech recognition. In: INTERSPEECH (2019)

    Google Scholar 

  20. Shan, C., Zhang, J., Wang, Y., Xie, L.: Attention-based end-to-end speech recognition on voice search. In: ICASSP (2018)

    Google Scholar 

  21. Vyas, A., Madikeri, S.R., Bourlard, H.: Comparing CTC and LFMMI for out-of-domain adaptation of Wav2Vec 2.0 acoustic model. In: Interspeech (2021)

    Google Scholar 

  22. Weninger, F., Mana, F., Gemello, R., Andrés-Ferrer, J., Zhan, P.: Semi-supervised learning with data augmentation for end-to-end ASR. In: INTERSPEECH (2020)

    Google Scholar 

  23. Xu, Q., Likhomanenko, T., Kahn, J., Hannun, A.Y., Synnaeve, G., Collobert, R.: Iterative pseudo-labeling for speech recognition. In: INTERSPEECH (2020)

    Google Scholar 

  24. Zhao, X., et al.: Disentangling content and fine-grained prosody information via hybrid ASR bottleneck features for voice conversion. In: ICASSP (2022)

    Google Scholar 

Download references

Acknowledgement

This work was supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) [No. 2022-0-00641, XVoice: Multi-Modal Voice Meta Learning], [No. 2021-0-01343, Artificial Intelligence Graduate School Program (Seoul National University)], and [NO. 2021-0-02068, Artificial Intelligence Innovation Hub (Artificial Intelligence Institute, Seoul National University)].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to U Kang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kim, J., Park, K.H., Kang, U. (2024). Accurate Semi-supervised Automatic Speech Recognition via Multi-hypotheses-Based Curriculum Learning. In: Yang, DN., Xie, X., Tseng, V.S., Pei, J., Huang, JW., Lin, J.CW. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2024. Lecture Notes in Computer Science(), vol 14649. Springer, Singapore. https://doi.org/10.1007/978-981-97-2262-4_4

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-2262-4_4

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-2264-8

  • Online ISBN: 978-981-97-2262-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics