Skip to main content

AugMixSpeech: A Data Augmentation Method and Consistency Regularization for Mandarin Automatic Speech Recognition

  • Conference paper
  • First Online:
Natural Language Processing and Chinese Computing (NLPCC 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15361))

  • 246 Accesses

Abstract

Automatic speech recognition (ASR) is a crucial technology in the field of artificial intelligence, widely applied in modern society. The deep learning-based ASR method offers a simpler training framework and higher recognition rates compared to the traditional method. However, it requires large amounts of training data to perform well, and insufficient data can lead to model overfitting. To overcome these problems, we propose a novel data augmentation framework called AugMixSpeech, which generates more natural and diverse data by randomly sampling different augmentation techniques and mixing the augmented data. Besides, in order to ensure that the model maintains stable predictions when faced with these data, we introduce a consistency regularization method that includes global consistency and local consistency. The constraints imposed by this method enable the model to better learn the intrinsic features of the data. Extensive experiments on the validation and test of Aishell-1 achieve recognition accuracy of 4.23% and 4.79%, which outperforms existing approaches and demonstrates its effectiveness in automatic speech recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Kheddar, H., Hemis, M., Himeur, Y.: Automatic speech recognition using advanced deep learning approaches: a survey. Inf. Fusion 102422 (2024)

    Google Scholar 

  2. Prabhavalkar, R., Hori, T., Sainath, T.N., Schlüter, R., Watanabe, S.: End-to-end speech recognition: a survey. IEEE/ACM Trans. Audio Speech Lang. Process. (2023)

    Google Scholar 

  3. Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020)

    Google Scholar 

  4. Ding, K., Li, R., Xu, Y., Du, X., Deng, B.: Adaptive data augmentation for mandarin automatic speech recognition. Appl. Intell. 54(7), 5674–5687 (2024)

    Article  Google Scholar 

  5. Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: Interspeech, vol. 2015, p. 3586 (2015)

    Google Scholar 

  6. Wang, Y., Getreuer, P., Hughes, T., Lyon, R.F., Saurous, R.A.: Trainable frontend for robust and far-field keyword spotting. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5670–5674 (2017)

    Google Scholar 

  7. Park, D.S., et al.: Specaugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779 (2019)

  8. Wu, D., et al.: U2++: unified two-pass bidirectional end-to-end model for speech recognition. arXiv preprint arXiv:2106.05642 (2021)

  9. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. In: International Conference on Learning Representations (2018)

    Google Scholar 

  10. Meng, L., Xu, J., Tan, X., Wang, J., Qin, T., Xu, B.: Mixspeech: data augmentation for low-resource automatic speech recognition. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7008–7012. IEEE (2021)

    Google Scholar 

  11. Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  12. Hendrycks, D., Mu, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: Augmix: a simple method to improve robustness and uncertainty under data shift. In: International Conference on Learning Representations, vol. 1, p. 5 (2020)

    Google Scholar 

  13. Kim, J., Choo, W., Jeong, H., Song, H.O.: Co-mixup: saliency guided joint mixup with supermodular diversity. arxiv:2102.03065 (2021)

  14. Ng, D., et al.: Contrastive speech mixup for low-resource keyword spotting. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)

    Google Scholar 

  15. Qiu, S.: Construction of English speech recognition model by fusing cnn and random deep factorization tdnn. ACM Trans. Asian Low-Res. Lang. Inf. Process. (2023)

    Google Scholar 

  16. Zhang, N., Wang, J., Wei, W., Qu, X., Cheng, N., Xiao, J.: Cacnet: cube attentional cnn for automatic speech recognition. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE (2021)

    Google Scholar 

  17. Oruh, J., Viriri, S., Adegun, A.: Long short-term memory recurrent neural network for automatic speech recognition. IEEE Access 10, 30069–30079 (2022)

    Article  Google Scholar 

  18. Fang, Y., Li, X.: Unimodal aggregation for ctc-based speech recognition. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10591–10595 (2024)

    Google Scholar 

  19. Lei, Z., et al.: Personalization of ctc-based end-to-end speech recognition using pronunciation-driven subword tokenization. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10096–10100 (2024)

    Google Scholar 

  20. Gong, X., Wang, W., Shao, H., Chen, X., Qian, Y.: Factorized aed: factorized attention-based encoder-decoder for text-only domain adaptive asr. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)

    Google Scholar 

  21. Fan, R., Chu, W., Chang, P., Alwan, A.: A ctc alignment-based non-autoregressive transformer for end-to-end automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 1436–1448 (2023)

    Article  Google Scholar 

  22. Lyu, B., Fan, C., Ming, Y., Zhao, P., Hu, N.: En-hacn: enhancing hybrid architecture with fast attention and capsule network for end-to-end speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 1050–1062 (2023)

    Article  Google Scholar 

  23. Chang, F.J., Radfar, M., Mouchtaris, A., King, B., Kunzmann, S.: End-to-end multi-channel transformer for speech recognition. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888 (2021)

    Google Scholar 

  24. Anmol, G., et al.: Conformer: convolution-augmented transformer for speech recognition. In: Conference of the International Speech Communication Association, pp. 5036–5040 (2020)

    Google Scholar 

  25. Burchi, M., Vielzeuf, V.: Efficient conformer: progressive downsampling and grouped attention for automatic speech recognition. In: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 8–15 (2021)

    Google Scholar 

  26. Kim, S., et al.: Squeezeformer: an efficient transformer for automatic speech recognition. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 9361–9373. Curran Associates, Inc. (2022)

    Google Scholar 

  27. Kang, W.H., Alam, J., Fathan, A.: L-mix: a latent-level instance mixup regularization for robust self-supervised speaker representation learning. IEEE J. Sel. Topics Signal Process. 16(6), 1263–1272 (2022)

    Article  Google Scholar 

  28. Johnson, D.H., Sinanovic, S., et al.: Symmetrizing the kullback-leibler distance. IEEE Trans. Inf. Theory 1(1), 1–10 (2001)

    Google Scholar 

  29. Bu, H., Du, J., Na, X., Wu, B., Zheng, H.: Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline. In: 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5. IEEE (2017)

    Google Scholar 

  30. Gao, Z., Zhang, S., McLoughlin, I., Yan, Z.: Paraformer: fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition. In: Conference of the International Speech Communication Association, pp. 2063–2067 (2022)

    Google Scholar 

  31. Lai, Z.H., et al.: InterFormer: interactive local and global features fusion for automatic speech recognition. In: Proceedings of INTERSPEECH 2023, pp. 566–570 (2023)

    Google Scholar 

  32. Liang, C., et al.: Fast-u2++: fast and accurate end-to-end speech recognition in joint ctc/attention frames. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)

    Google Scholar 

  33. Wang, J., Liang, Z., Zhang, X., Cheng, N., Xiao, J.: Efficientasr: speech recognition network compression via attention redundancy and chunk-level ffn optimization. arXiv preprint arXiv:2404.19214 (2024)

  34. Li, J., Duan, Z., Li, S., Yu, X., Yang, G.: Esaformer: enhanced self-attention for automatic speech recognition. IEEE Signal Process. Lett. 31, 471–475 (2024)

    Article  Google Scholar 

  35. Gao, G., et al.: Information extraction and noisy feature pruning for mandarin speech recognition. J. Audio Eng. Soc. 72(1/2), 59–70 (2024)

    Article  Google Scholar 

  36. Wang, F., Xu, B., Xu, B.: Sscformer: push the limit of chunk-wise conformer for streaming asr using sequentially sampled chunks and chunked causal convolution. IEEE Signal Process. Lett. 31, 421–425 (2024)

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (62276116); Six talent peaks project in Jiangsu Province (DZXX-122). Jiangsu Graduate Research Innovation Program (KYCX23_3677).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jiang, Y. et al. (2025). AugMixSpeech: A Data Augmentation Method and Consistency Regularization for Mandarin Automatic Speech Recognition. In: Wong, D.F., Wei, Z., Yang, M. (eds) Natural Language Processing and Chinese Computing. NLPCC 2024. Lecture Notes in Computer Science(), vol 15361. Springer, Singapore. https://doi.org/10.1007/978-981-97-9437-9_12

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-9437-9_12

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-9436-2

  • Online ISBN: 978-981-97-9437-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics