AugMixSpeech: A Data Augmentation Method and Consistency Regularization for Mandarin Automatic Speech Recognition

Jiang, Yang; Chen, Jun; Han, Kai; Liu, Yi; Ma, Siqi; Song, Yuqing; Liu, Zhe

doi:10.1007/978-981-97-9437-9_12

Yang Jiang¹⁰,
Jun Chen¹⁰,
Kai Han¹⁰,
Yi Liu¹⁰,
Siqi Ma¹⁰,
Yuqing Song¹⁰ &
…
Zhe Liu¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15361))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

246 Accesses

Abstract

Automatic speech recognition (ASR) is a crucial technology in the field of artificial intelligence, widely applied in modern society. The deep learning-based ASR method offers a simpler training framework and higher recognition rates compared to the traditional method. However, it requires large amounts of training data to perform well, and insufficient data can lead to model overfitting. To overcome these problems, we propose a novel data augmentation framework called AugMixSpeech, which generates more natural and diverse data by randomly sampling different augmentation techniques and mixing the augmented data. Besides, in order to ensure that the model maintains stable predictions when faced with these data, we introduce a consistency regularization method that includes global consistency and local consistency. The constraints imposed by this method enable the model to better learn the intrinsic features of the data. Extensive experiments on the validation and test of Aishell-1 achieve recognition accuracy of 4.23% and 4.79%, which outperforms existing approaches and demonstrates its effectiveness in automatic speech recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Exploring data augmentation for Amazigh speech recognition with convolutional neural networks

Article 14 November 2024

Training Data Augmentation and Data Selection

A Survey of the Effects of Data Augmentation for Automatic Speech Recognition Systems

References

Kheddar, H., Hemis, M., Himeur, Y.: Automatic speech recognition using advanced deep learning approaches: a survey. Inf. Fusion 102422 (2024)
Google Scholar
Prabhavalkar, R., Hori, T., Sainath, T.N., Schlüter, R., Watanabe, S.: End-to-end speech recognition: a survey. IEEE/ACM Trans. Audio Speech Lang. Process. (2023)
Google Scholar
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020)
Google Scholar
Ding, K., Li, R., Xu, Y., Du, X., Deng, B.: Adaptive data augmentation for mandarin automatic speech recognition. Appl. Intell. 54(7), 5674–5687 (2024)
Article Google Scholar
Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: Interspeech, vol. 2015, p. 3586 (2015)
Google Scholar
Wang, Y., Getreuer, P., Hughes, T., Lyon, R.F., Saurous, R.A.: Trainable frontend for robust and far-field keyword spotting. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5670–5674 (2017)
Google Scholar
Park, D.S., et al.: Specaugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779 (2019)
Wu, D., et al.: U2++: unified two-pass bidirectional end-to-end model for speech recognition. arXiv preprint arXiv:2106.05642 (2021)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. In: International Conference on Learning Representations (2018)
Google Scholar
Meng, L., Xu, J., Tan, X., Wang, J., Qin, T., Xu, B.: Mixspeech: data augmentation for low-resource automatic speech recognition. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7008–7012. IEEE (2021)
Google Scholar
Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Hendrycks, D., Mu, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: Augmix: a simple method to improve robustness and uncertainty under data shift. In: International Conference on Learning Representations, vol. 1, p. 5 (2020)
Google Scholar
Kim, J., Choo, W., Jeong, H., Song, H.O.: Co-mixup: saliency guided joint mixup with supermodular diversity. arxiv:2102.03065 (2021)
Ng, D., et al.: Contrastive speech mixup for low-resource keyword spotting. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
Google Scholar
Qiu, S.: Construction of English speech recognition model by fusing cnn and random deep factorization tdnn. ACM Trans. Asian Low-Res. Lang. Inf. Process. (2023)
Google Scholar
Zhang, N., Wang, J., Wei, W., Qu, X., Cheng, N., Xiao, J.: Cacnet: cube attentional cnn for automatic speech recognition. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE (2021)
Google Scholar
Oruh, J., Viriri, S., Adegun, A.: Long short-term memory recurrent neural network for automatic speech recognition. IEEE Access 10, 30069–30079 (2022)
Article Google Scholar
Fang, Y., Li, X.: Unimodal aggregation for ctc-based speech recognition. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10591–10595 (2024)
Google Scholar
Lei, Z., et al.: Personalization of ctc-based end-to-end speech recognition using pronunciation-driven subword tokenization. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10096–10100 (2024)
Google Scholar
Gong, X., Wang, W., Shao, H., Chen, X., Qian, Y.: Factorized aed: factorized attention-based encoder-decoder for text-only domain adaptive asr. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)
Google Scholar
Fan, R., Chu, W., Chang, P., Alwan, A.: A ctc alignment-based non-autoregressive transformer for end-to-end automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 1436–1448 (2023)
Article Google Scholar
Lyu, B., Fan, C., Ming, Y., Zhao, P., Hu, N.: En-hacn: enhancing hybrid architecture with fast attention and capsule network for end-to-end speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 1050–1062 (2023)
Article Google Scholar
Chang, F.J., Radfar, M., Mouchtaris, A., King, B., Kunzmann, S.: End-to-end multi-channel transformer for speech recognition. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888 (2021)
Google Scholar
Anmol, G., et al.: Conformer: convolution-augmented transformer for speech recognition. In: Conference of the International Speech Communication Association, pp. 5036–5040 (2020)
Google Scholar
Burchi, M., Vielzeuf, V.: Efficient conformer: progressive downsampling and grouped attention for automatic speech recognition. In: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 8–15 (2021)
Google Scholar
Kim, S., et al.: Squeezeformer: an efficient transformer for automatic speech recognition. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 9361–9373. Curran Associates, Inc. (2022)
Google Scholar
Kang, W.H., Alam, J., Fathan, A.: L-mix: a latent-level instance mixup regularization for robust self-supervised speaker representation learning. IEEE J. Sel. Topics Signal Process. 16(6), 1263–1272 (2022)
Article Google Scholar
Johnson, D.H., Sinanovic, S., et al.: Symmetrizing the kullback-leibler distance. IEEE Trans. Inf. Theory 1(1), 1–10 (2001)
Google Scholar
Bu, H., Du, J., Na, X., Wu, B., Zheng, H.: Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline. In: 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5. IEEE (2017)
Google Scholar
Gao, Z., Zhang, S., McLoughlin, I., Yan, Z.: Paraformer: fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition. In: Conference of the International Speech Communication Association, pp. 2063–2067 (2022)
Google Scholar
Lai, Z.H., et al.: InterFormer: interactive local and global features fusion for automatic speech recognition. In: Proceedings of INTERSPEECH 2023, pp. 566–570 (2023)
Google Scholar
Liang, C., et al.: Fast-u2++: fast and accurate end-to-end speech recognition in joint ctc/attention frames. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)
Google Scholar
Wang, J., Liang, Z., Zhang, X., Cheng, N., Xiao, J.: Efficientasr: speech recognition network compression via attention redundancy and chunk-level ffn optimization. arXiv preprint arXiv:2404.19214 (2024)
Li, J., Duan, Z., Li, S., Yu, X., Yang, G.: Esaformer: enhanced self-attention for automatic speech recognition. IEEE Signal Process. Lett. 31, 471–475 (2024)
Article Google Scholar
Gao, G., et al.: Information extraction and noisy feature pruning for mandarin speech recognition. J. Audio Eng. Soc. 72(1/2), 59–70 (2024)
Article Google Scholar
Wang, F., Xu, B., Xu, B.: Sscformer: push the limit of chunk-wise conformer for streaming asr using sequentially sampled chunks and chunked causal convolution. IEEE Signal Process. Lett. 31, 421–425 (2024)
Article Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (62276116); Six talent peaks project in Jiangsu Province (DZXX-122). Jiangsu Graduate Research Innovation Program (KYCX23_3677).

Author information

Authors and Affiliations

School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, China
Yang Jiang, Jun Chen, Kai Han, Yi Liu, Siqi Ma, Yuqing Song & Zhe Liu

Authors

Yang Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Kai Han
View author publications
You can also search for this author in PubMed Google Scholar
Yi Liu
View author publications
You can also search for this author in PubMed Google Scholar
Siqi Ma
View author publications
You can also search for this author in PubMed Google Scholar
Yuqing Song
View author publications
You can also search for this author in PubMed Google Scholar
Zhe Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun Chen .

Editor information

Editors and Affiliations

University of Macau, Macao, China
Derek F. Wong
Fudan University, Shanghai, China
Zhongyu Wei
Harbin Institute of Technology, Harbin, China
Muyun Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, Y. et al. (2025). AugMixSpeech: A Data Augmentation Method and Consistency Regularization for Mandarin Automatic Speech Recognition. In: Wong, D.F., Wei, Z., Yang, M. (eds) Natural Language Processing and Chinese Computing. NLPCC 2024. Lecture Notes in Computer Science(), vol 15361. Springer, Singapore. https://doi.org/10.1007/978-981-97-9437-9_12

Download citation

DOI: https://doi.org/10.1007/978-981-97-9437-9_12
Published: 01 November 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-9436-2
Online ISBN: 978-981-97-9437-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)

AugMixSpeech: A Data Augmentation Method and Consistency Regularization for Mandarin Automatic Speech Recognition