Abstract
The conventional recipe for Automatic Speech Recognition (ASR) models is to 1) train multiple checkpoints on a training set while relying on a validation set to prevent over fitting using early stopping and 2) average several last checkpoints or that of the lowest validation losses to obtain the final model. In this paper, we rethink and update the early stopping and checkpoint averaging from the perspective of the bias-variance tradeoff. Theoretically, the bias and variance represent the fitness and variability of a model and the tradeoff of them determines the overall generalization error. But, it’s impractical to evaluate them precisely. As an alternative, we take the training loss and validation loss as proxies of bias and variance and guide the early stopping and checkpoint averaging using their tradeoff, namely an Approximated Bias-Variance Tradeoff (ApproBiVT). When evaluating with advanced ASR models, our recipe provides 2.5%–3.7% and 3.1%–4.6% CER reduction on the AISHELL-1 and AISHELL-2, respectively (The code and sampled unaugmented training sets used in this paper will be public available on GitHub).
Supported by the National Innovation 2030 Major S&T Project of China under Grant 2020AAA0104202 and the Basic Research of the Academy of Broadcasting Science, NRTA, under Grant JBKY20230180.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Li, J., et al.: Jasper: an end-to-end convolutional neural acoustic model. In: Interspeech 2019–20rd Annual Conference of the International Speech Communication Association (2019)
Kriman, S., et al.: QuartzNet: deep automatic speech recognition with 1D time-channel separable convolutions. In: ICASSP 2020–45rd IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6124–6128. May 4–8, Barcelona, Spain (2020)
Han, K.J., Pan, J., Naveen Tadala, V.K., Ma, T., Povey, D.: Multistream CNN for robust acoustic modeling. In: ICASSP 2021–46rd IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6873–6877. Jun. 6–11, Toronto, Ontario, Canada (2021)
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: ICASSP 2016–41rd IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4960–4964. Mar. 20–25, Shanghai, China (2016)
Rao, K., Sak, H., Prabhavalkar, R.: Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer. In: ASRU 2017–2017 IEEE Automatic Speech Recognition and Understanding Workshop, pp. 193–199. Dec. 16–20, Okinawa, Japan (2017)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: NIPS 2017–31rd Conference on Neural Information Processing Systems, pp. 5998–6008. Dec. 4–9, Long Beach, California, U.S.A. (2017)
Dong, L., Xu, S., Xu, B.: Speech-Transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: ICASSP 2018–43rd IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5884–5888. Apr. 22–27, Seoul, South Korea (2018)
Moritz, N., Hori, T., Roux, J.L.: Streaming automatic speech recognition with the transformer model. In: ICASSP 2020–45rd IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6074–6078. May 4–8, Barcelona, Spain (2020)
Gulati, A., Qin, J., Chiu, C.C., et al.: Conformer: convolution-augmented transformer for speech recognition. In: Interspeech 2020–21rd Annual Conference of the International Speech Communication Association, pp. 5036–5040. Oct. 25–30, Shanghai, China (2020)
Zhang, B.B., Wu, D., Yao, Z.Y., et al.: Unified streaming and non-streaming two-pass end-to-end model for speech recognition. arXiv preprint arXiv:2012.05481 (2020)
Wu, D., Zhang, B.B., Yang, C., et al.: U2++: unified two-pass bidirectional end-to-end model for speech recognition. arXiv preprint arXiv:2106.05642 (2021)
An, K., Zheng, H., Ou, Z., Xiang, H., Ding, K., Wan, G.: CUSIDE: chunking, simulating future context and decoding for streaming ASR. arXiv preprint arXiv:2203.16758 (2022)
Ren, X., Zhu, H., Wei, L., Wu, M., Hao, J.: Improving mandarin speech recogntion with block-augmented transformer. arXiv preprint ArXiv:2207.11697 (2022)
Wang, F., Xu, B.: Shifted chunk encoder for transformer based streaming end-to-end ASR. In: ICONIP 2022–29rd International Conference on Neural Information Processing, Part V, pp. 39–51. Nov. 22–26, Virtual Event, India (2022)
Kim, S., Gholami, A., Eaton, A., et al.: Squeezeformer: an efficient transformer for automatic speech recognition. arXiv preprint ArXiv:2206.00888 (2022)
Geman, S., Bienenstock, E., Doursa, R.: Neural networks and the bias/variance dilemma. Neural Comput. 4, 1–58 (1992)
Morgan, N., Bourlard, H.: Generalization and parameter estimation in feedforward netws: some experiments. In: NIPS 1990–4rd Conference on Neural Information Processing Systems (1990)
Reed, R.: Pruning algorithms-a survey. IEEE Trans. Neural Netw. 4(5), 740–747 (1993)
Prechelt, L.: Early stopping-but when? In Neural Networks (1996)
Popel, M., Bojar, O.: Training tips for the transformer model. The Prague Bull. Math. Linguist. 110, 43–70 (2018)
Yao, Z., Wu, D., Wang, X., et al.: WeNet: production oriented streaming and non-streaming end-to-end speech recognition toolkit. In: Interspeech 2021–22rd Annual Conference of the International Speech Communication Association, Aug. 30-Sep. 3, Brno, Czech Republic (2021)
Park, D.S., Chan, W., Zhang, Y., Chiu, C.C., et al.: SpecAugment: a simple data augmentation method for automatic speech recognition. In: Interspeech 2019–20rd Annual Conference of the International Speech Communication Association, pp. 2613–2617. Graz, Austria (2019)
Bouthillier, X., Konda, K., Vincent, P., Memisevic, R.: Dropout as data augmentation. arXiv preprint arXiv:1506.08700 (2015)
Bu, H., Du, J., Na, X., Wu, B., Zheng, H.: Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline. In: O-COCOSDA 2017–20rd Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, pp. 1–5. Nov. 1–3, Seoul, South Korea (2015)
Du, J., Na, X., Liu, X., Bu, H.: AISHELL-2: transforming mandarin ASR research into industrial scale. arXiv preprint ArXiv:1808.10583 (2018)
Heskes, T.M.: Bias/Variance decompositions for likelihood-based estimators. Neural Comput. 10, 1425–1433 (1998)
Hoffer, E., Hubara, I., Soudry, D.: Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint ArXiv:1705.08741 (2017)
Jais, I.K.M., Ismail, A.R., Nisa, S.Q.: Adam optimization algorithm for wide and deep neural network. Knowl. Eng. Data Sci. 2, 41–46 (2019)
Gao, Y., Herold, Y., Yang, Z., Ney, H.: Revisiting checkpoint averaging for neural machine translation. In: AACL/IJCNLP (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, F., Hao, M., Shi, Y., Xu, B. (2024). Lead ASR Models to Generalize Better Using Approximated Bias-Variance Tradeoff. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1961. Springer, Singapore. https://doi.org/10.1007/978-981-99-8126-7_14
Download citation
DOI: https://doi.org/10.1007/978-981-99-8126-7_14
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8125-0
Online ISBN: 978-981-99-8126-7
eBook Packages: Computer ScienceComputer Science (R0)