Lead ASR Models to Generalize Better Using Approximated Bias-Variance Tradeoff

Wang, Fangyuan; Hao, Ming; Shi, Yuhai; Xu, Bo

doi:10.1007/978-981-99-8126-7_14

Fangyuan Wang ORCID: orcid.org/0000-0002-6482-4522¹⁰,
Ming Hao¹¹,
Yuhai Shi¹¹ &
…
Bo Xu^10,12,13

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1961))

Included in the following conference series:

International Conference on Neural Information Processing

338 Accesses

Abstract

The conventional recipe for Automatic Speech Recognition (ASR) models is to 1) train multiple checkpoints on a training set while relying on a validation set to prevent over fitting using early stopping and 2) average several last checkpoints or that of the lowest validation losses to obtain the final model. In this paper, we rethink and update the early stopping and checkpoint averaging from the perspective of the bias-variance tradeoff. Theoretically, the bias and variance represent the fitness and variability of a model and the tradeoff of them determines the overall generalization error. But, it’s impractical to evaluate them precisely. As an alternative, we take the training loss and validation loss as proxies of bias and variance and guide the early stopping and checkpoint averaging using their tradeoff, namely an Approximated Bias-Variance Tradeoff (ApproBiVT). When evaluating with advanced ASR models, our recipe provides 2.5%–3.7% and 3.1%–4.6% CER reduction on the AISHELL-1 and AISHELL-2, respectively (The code and sampled unaugmented training sets used in this paper will be public available on GitHub).

Supported by the National Innovation 2030 Major S&T Project of China under Grant 2020AAA0104202 and the Basic Research of the Academy of Broadcasting Science, NRTA, under Grant JBKY20230180.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Li, J., et al.: Jasper: an end-to-end convolutional neural acoustic model. In: Interspeech 2019–20^rd Annual Conference of the International Speech Communication Association (2019)
Google Scholar
Kriman, S., et al.: QuartzNet: deep automatic speech recognition with 1D time-channel separable convolutions. In: ICASSP 2020–45^rd IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6124–6128. May 4–8, Barcelona, Spain (2020)
Google Scholar
Han, K.J., Pan, J., Naveen Tadala, V.K., Ma, T., Povey, D.: Multistream CNN for robust acoustic modeling. In: ICASSP 2021–46^rd IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6873–6877. Jun. 6–11, Toronto, Ontario, Canada (2021)
Google Scholar
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: ICASSP 2016–41^rd IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4960–4964. Mar. 20–25, Shanghai, China (2016)
Google Scholar
Rao, K., Sak, H., Prabhavalkar, R.: Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer. In: ASRU 2017–2017 IEEE Automatic Speech Recognition and Understanding Workshop, pp. 193–199. Dec. 16–20, Okinawa, Japan (2017)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: NIPS 2017–31^rd Conference on Neural Information Processing Systems, pp. 5998–6008. Dec. 4–9, Long Beach, California, U.S.A. (2017)
Google Scholar
Dong, L., Xu, S., Xu, B.: Speech-Transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: ICASSP 2018–43^rd IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5884–5888. Apr. 22–27, Seoul, South Korea (2018)
Google Scholar
Moritz, N., Hori, T., Roux, J.L.: Streaming automatic speech recognition with the transformer model. In: ICASSP 2020–45^rd IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6074–6078. May 4–8, Barcelona, Spain (2020)
Google Scholar
Gulati, A., Qin, J., Chiu, C.C., et al.: Conformer: convolution-augmented transformer for speech recognition. In: Interspeech 2020–21^rd Annual Conference of the International Speech Communication Association, pp. 5036–5040. Oct. 25–30, Shanghai, China (2020)
Google Scholar
Zhang, B.B., Wu, D., Yao, Z.Y., et al.: Unified streaming and non-streaming two-pass end-to-end model for speech recognition. arXiv preprint arXiv:2012.05481 (2020)
Wu, D., Zhang, B.B., Yang, C., et al.: U2++: unified two-pass bidirectional end-to-end model for speech recognition. arXiv preprint arXiv:2106.05642 (2021)
An, K., Zheng, H., Ou, Z., Xiang, H., Ding, K., Wan, G.: CUSIDE: chunking, simulating future context and decoding for streaming ASR. arXiv preprint arXiv:2203.16758 (2022)
Ren, X., Zhu, H., Wei, L., Wu, M., Hao, J.: Improving mandarin speech recogntion with block-augmented transformer. arXiv preprint ArXiv:2207.11697 (2022)
Wang, F., Xu, B.: Shifted chunk encoder for transformer based streaming end-to-end ASR. In: ICONIP 2022–29^rd International Conference on Neural Information Processing, Part V, pp. 39–51. Nov. 22–26, Virtual Event, India (2022)
Google Scholar
Kim, S., Gholami, A., Eaton, A., et al.: Squeezeformer: an efficient transformer for automatic speech recognition. arXiv preprint ArXiv:2206.00888 (2022)
Geman, S., Bienenstock, E., Doursa, R.: Neural networks and the bias/variance dilemma. Neural Comput. 4, 1–58 (1992)
Article Google Scholar
Morgan, N., Bourlard, H.: Generalization and parameter estimation in feedforward netws: some experiments. In: NIPS 1990–4^rd Conference on Neural Information Processing Systems (1990)
Google Scholar
Reed, R.: Pruning algorithms-a survey. IEEE Trans. Neural Netw. 4(5), 740–747 (1993)
Article Google Scholar
Prechelt, L.: Early stopping-but when? In Neural Networks (1996)
Google Scholar
Popel, M., Bojar, O.: Training tips for the transformer model. The Prague Bull. Math. Linguist. 110, 43–70 (2018)
Article Google Scholar
Yao, Z., Wu, D., Wang, X., et al.: WeNet: production oriented streaming and non-streaming end-to-end speech recognition toolkit. In: Interspeech 2021–22^rd Annual Conference of the International Speech Communication Association, Aug. 30-Sep. 3, Brno, Czech Republic (2021)
Google Scholar
Park, D.S., Chan, W., Zhang, Y., Chiu, C.C., et al.: SpecAugment: a simple data augmentation method for automatic speech recognition. In: Interspeech 2019–20^rd Annual Conference of the International Speech Communication Association, pp. 2613–2617. Graz, Austria (2019)
Google Scholar
Bouthillier, X., Konda, K., Vincent, P., Memisevic, R.: Dropout as data augmentation. arXiv preprint arXiv:1506.08700 (2015)
Bu, H., Du, J., Na, X., Wu, B., Zheng, H.: Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline. In: O-COCOSDA 2017–20^rd Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, pp. 1–5. Nov. 1–3, Seoul, South Korea (2015)
Google Scholar
Du, J., Na, X., Liu, X., Bu, H.: AISHELL-2: transforming mandarin ASR research into industrial scale. arXiv preprint ArXiv:1808.10583 (2018)
Heskes, T.M.: Bias/Variance decompositions for likelihood-based estimators. Neural Comput. 10, 1425–1433 (1998)
Article Google Scholar
Hoffer, E., Hubara, I., Soudry, D.: Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint ArXiv:1705.08741 (2017)
Jais, I.K.M., Ismail, A.R., Nisa, S.Q.: Adam optimization algorithm for wide and deep neural network. Knowl. Eng. Data Sci. 2, 41–46 (2019)
Article Google Scholar
Gao, Y., Herold, Y., Yang, Z., Ney, H.: Revisiting checkpoint averaging for neural machine translation. In: AACL/IJCNLP (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Automation, Chinese Academy of Sciences, Beijing, China
Fangyuan Wang & Bo Xu
Academy of Broadcasting Science, National Radio and Television Administration, Beijing, China
Ming Hao & Yuhai Shi
School of Future Technology, University of Chinese Academy of Sciences, Beijing, China
Bo Xu
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Bo Xu

Authors

Fangyuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ming Hao
View author publications
You can also search for this author in PubMed Google Scholar
Yuhai Shi
View author publications
You can also search for this author in PubMed Google Scholar
Bo Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fangyuan Wang .

Editor information

Editors and Affiliations

School of Automation, Central South University, Changsha, China
Biao Luo
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Long Cheng
Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, China
Zheng-Guang Wu
School of Automation, Guangdong University of Technology, Guangzhou, China
Hongyi Li
School of Electrical Engineering and Telecommunications, UNSW Sydney, Sydney, NSW, Australia
Chaojie Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, F., Hao, M., Shi, Y., Xu, B. (2024). Lead ASR Models to Generalize Better Using Approximated Bias-Variance Tradeoff. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1961. Springer, Singapore. https://doi.org/10.1007/978-981-99-8126-7_14

Download citation

DOI: https://doi.org/10.1007/978-981-99-8126-7_14
Published: 13 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8125-0
Online ISBN: 978-981-99-8126-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Lead ASR Models to Generalize Better Using Approximated Bias-Variance Tradeoff