Abstract
Automatic speech recognition (ASR) is a crucial technology in the field of artificial intelligence, widely applied in modern society. The deep learning-based ASR method offers a simpler training framework and higher recognition rates compared to the traditional method. However, it requires large amounts of training data to perform well, and insufficient data can lead to model overfitting. To overcome these problems, we propose a novel data augmentation framework called AugMixSpeech, which generates more natural and diverse data by randomly sampling different augmentation techniques and mixing the augmented data. Besides, in order to ensure that the model maintains stable predictions when faced with these data, we introduce a consistency regularization method that includes global consistency and local consistency. The constraints imposed by this method enable the model to better learn the intrinsic features of the data. Extensive experiments on the validation and test of Aishell-1 achieve recognition accuracy of 4.23% and 4.79%, which outperforms existing approaches and demonstrates its effectiveness in automatic speech recognition.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Kheddar, H., Hemis, M., Himeur, Y.: Automatic speech recognition using advanced deep learning approaches: a survey. Inf. Fusion 102422 (2024)
Prabhavalkar, R., Hori, T., Sainath, T.N., Schlüter, R., Watanabe, S.: End-to-end speech recognition: a survey. IEEE/ACM Trans. Audio Speech Lang. Process. (2023)
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020)
Ding, K., Li, R., Xu, Y., Du, X., Deng, B.: Adaptive data augmentation for mandarin automatic speech recognition. Appl. Intell. 54(7), 5674–5687 (2024)
Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: Interspeech, vol. 2015, p. 3586 (2015)
Wang, Y., Getreuer, P., Hughes, T., Lyon, R.F., Saurous, R.A.: Trainable frontend for robust and far-field keyword spotting. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5670–5674 (2017)
Park, D.S., et al.: Specaugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779 (2019)
Wu, D., et al.: U2++: unified two-pass bidirectional end-to-end model for speech recognition. arXiv preprint arXiv:2106.05642 (2021)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. In: International Conference on Learning Representations (2018)
Meng, L., Xu, J., Tan, X., Wang, J., Qin, T., Xu, B.: Mixspeech: data augmentation for low-resource automatic speech recognition. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7008–7012. IEEE (2021)
Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Hendrycks, D., Mu, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: Augmix: a simple method to improve robustness and uncertainty under data shift. In: International Conference on Learning Representations, vol. 1, p. 5 (2020)
Kim, J., Choo, W., Jeong, H., Song, H.O.: Co-mixup: saliency guided joint mixup with supermodular diversity. arxiv:2102.03065 (2021)
Ng, D., et al.: Contrastive speech mixup for low-resource keyword spotting. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
Qiu, S.: Construction of English speech recognition model by fusing cnn and random deep factorization tdnn. ACM Trans. Asian Low-Res. Lang. Inf. Process. (2023)
Zhang, N., Wang, J., Wei, W., Qu, X., Cheng, N., Xiao, J.: Cacnet: cube attentional cnn for automatic speech recognition. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE (2021)
Oruh, J., Viriri, S., Adegun, A.: Long short-term memory recurrent neural network for automatic speech recognition. IEEE Access 10, 30069–30079 (2022)
Fang, Y., Li, X.: Unimodal aggregation for ctc-based speech recognition. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10591–10595 (2024)
Lei, Z., et al.: Personalization of ctc-based end-to-end speech recognition using pronunciation-driven subword tokenization. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10096–10100 (2024)
Gong, X., Wang, W., Shao, H., Chen, X., Qian, Y.: Factorized aed: factorized attention-based encoder-decoder for text-only domain adaptive asr. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)
Fan, R., Chu, W., Chang, P., Alwan, A.: A ctc alignment-based non-autoregressive transformer for end-to-end automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 1436–1448 (2023)
Lyu, B., Fan, C., Ming, Y., Zhao, P., Hu, N.: En-hacn: enhancing hybrid architecture with fast attention and capsule network for end-to-end speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 1050–1062 (2023)
Chang, F.J., Radfar, M., Mouchtaris, A., King, B., Kunzmann, S.: End-to-end multi-channel transformer for speech recognition. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888 (2021)
Anmol, G., et al.: Conformer: convolution-augmented transformer for speech recognition. In: Conference of the International Speech Communication Association, pp. 5036–5040 (2020)
Burchi, M., Vielzeuf, V.: Efficient conformer: progressive downsampling and grouped attention for automatic speech recognition. In: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 8–15 (2021)
Kim, S., et al.: Squeezeformer: an efficient transformer for automatic speech recognition. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 9361–9373. Curran Associates, Inc. (2022)
Kang, W.H., Alam, J., Fathan, A.: L-mix: a latent-level instance mixup regularization for robust self-supervised speaker representation learning. IEEE J. Sel. Topics Signal Process. 16(6), 1263–1272 (2022)
Johnson, D.H., Sinanovic, S., et al.: Symmetrizing the kullback-leibler distance. IEEE Trans. Inf. Theory 1(1), 1–10 (2001)
Bu, H., Du, J., Na, X., Wu, B., Zheng, H.: Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline. In: 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5. IEEE (2017)
Gao, Z., Zhang, S., McLoughlin, I., Yan, Z.: Paraformer: fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition. In: Conference of the International Speech Communication Association, pp. 2063–2067 (2022)
Lai, Z.H., et al.: InterFormer: interactive local and global features fusion for automatic speech recognition. In: Proceedings of INTERSPEECH 2023, pp. 566–570 (2023)
Liang, C., et al.: Fast-u2++: fast and accurate end-to-end speech recognition in joint ctc/attention frames. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)
Wang, J., Liang, Z., Zhang, X., Cheng, N., Xiao, J.: Efficientasr: speech recognition network compression via attention redundancy and chunk-level ffn optimization. arXiv preprint arXiv:2404.19214 (2024)
Li, J., Duan, Z., Li, S., Yu, X., Yang, G.: Esaformer: enhanced self-attention for automatic speech recognition. IEEE Signal Process. Lett. 31, 471–475 (2024)
Gao, G., et al.: Information extraction and noisy feature pruning for mandarin speech recognition. J. Audio Eng. Soc. 72(1/2), 59–70 (2024)
Wang, F., Xu, B., Xu, B.: Sscformer: push the limit of chunk-wise conformer for streaming asr using sequentially sampled chunks and chunked causal convolution. IEEE Signal Process. Lett. 31, 421–425 (2024)
Acknowledgements
This work is supported by the National Natural Science Foundation of China (62276116); Six talent peaks project in Jiangsu Province (DZXX-122). Jiangsu Graduate Research Innovation Program (KYCX23_3677).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Jiang, Y. et al. (2025). AugMixSpeech: A Data Augmentation Method and Consistency Regularization for Mandarin Automatic Speech Recognition. In: Wong, D.F., Wei, Z., Yang, M. (eds) Natural Language Processing and Chinese Computing. NLPCC 2024. Lecture Notes in Computer Science(), vol 15361. Springer, Singapore. https://doi.org/10.1007/978-981-97-9437-9_12
Download citation
DOI: https://doi.org/10.1007/978-981-97-9437-9_12
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-9436-2
Online ISBN: 978-981-97-9437-9
eBook Packages: Computer ScienceComputer Science (R0)