Exploiting Large-Scale Teacher-Student Training for On-Device Acoustic Models

Liu, Jing; Swaminathan, Rupak Vignesh; Parthasarathi, Sree Hari Krishnan; Lyu, Chunchuan; Mouchtaris, Athanasios; Kunzmann, Siegfried

doi:10.1007/978-3-030-83527-9_35

Jing Liu¹¹,
Rupak Vignesh Swaminathan¹¹,
Sree Hari Krishnan Parthasarathi¹¹,
Chunchuan Lyu¹²,
Athanasios Mouchtaris¹¹ &
…
Siegfried Kunzmann¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12848))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1339 Accesses

Abstract

We present results from Alexa speech teams on semi-supervised learning (SSL) of acoustic models (AM) with experiments spanning over 3000 h of GPU time, making our study one of the largest of its kind. We discuss SSL for AMs in a small footprint setting, showing that a smaller capacity model trained with 1 million hours of unsupervised data can outperform a baseline supervised system by 14.3% word error rate reduction (WERR). When increasing the supervised data to seven-fold, our gains diminish to 7.1% WERR; to improve SSL efficiency at larger supervised data regimes, we employ a step-wise distillation into a smaller model, obtaining a WERR of 14.4%. We then switch to SSL using larger student models in low data regimes; while learning efficiency with unsupervised data is higher, student models may outperform teacher models in such a setting. We develop a theoretical sketch to explain this behavior.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Efficiency-oriented approaches for self-supervised speech representation learning

Article 19 August 2024

Wise teachers train better DNN acoustic models

Article Open access 12 April 2016

Finnish parliament ASR corpus

Article Open access 27 March 2023

References

Amodei, D., Ananthanarayanan, S., et al.: Deep speech 2: end-to-end speech recognition in English and Mandarin. In: Procedings of ICML (2016)
Google Scholar
Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in Neural Information Processing Systems, pp. 2654–2662 (2014)
Google Scholar
Bégin, L., Germain, P., Laviolette, F., Roy, J.F.: PAC-Bayesian theory for transductive learning. In: Proceedings of of AISTATS (2014)
Google Scholar
Chen, K., Huo, Q.: Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In: Proceedings of ICASSP (2016)
Google Scholar
Chen, L., Leutnant, V.: Acoustic model bootstrapping using semi-supervised learning. In: Proceedings Interspeech 2019, pp. 3198–3202 (2019). https://doi.org/10.21437/Interspeech.2019-2818. http://dx.doi.org/10.21437/Interspeech.2019-2818
Chen, Y., Wang, W., Wang, C.: Semi-supervised ASR by end-to-end self-training (2020)
Google Scholar
Garimella, S., Mandal, A., Strom, N., Hoffmeister, B., Matsoukas, S., Parthasarathi, S.H.K.: Robust I-vector based adaptation of DNN acoustic model for speech recognition. In: Proceedings of Interspeech (2015)
Google Scholar
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. arXiv preprint arXiv:1706.04599 (2017)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Huang, Y., Wang, Y., Gong, Y.: Semi-supervised training in deep learning acoustic model. In: Proceedings of Interspeech (2016)
Google Scholar
Huang, Y., Yu, D., Gong, Y., Liu, C.: Semi-supervised GMM and DNN acoustic model training with multi-system combination and confidence re-calibration. In: Proceedings of Interspeech (2013)
Google Scholar
Kahn, J., Lee, A., Hannun, A.: Self-training for end-to-end speech recognition. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020. https://doi.org/10.1109/icassp40776.2020.9054295. http://dx.doi.org/10.1109/ICASSP40776.2020.9054295
Kemp, T., Waibel, A.: Unsupervised training of a speech recognizer: recent experiments. In: Proceedings of Eurospeech (1999)
Google Scholar
Kingsbury, B.: Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3761–3764 (2009). https://doi.org/10.1109/ICASSP.2009.4960445
Kurata, G., Audhkhasi, K.: Guiding CTC posterior spike timings for improved posterior fusion and knowledge distillation. CoRR abs/1904.08311 (2019). http://arxiv.org/abs/1904.08311
Lamel, L., Gauvain, J.L., Adda, G.: Lightly supervised and unsupervised acoustic model training. Comput. Speech Lang. 16, 115–129 (2002)
Article Google Scholar
Li, J., Mohamed, A., Zweig, G., Gong, Y.: LSTM time and frequency recurrence for automatic speech recognition. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 187–191 (2015). https://doi.org/10.1109/ASRU.2015.7404793
Li, J., Zhao, R., Huang, J.T., Gong, Y.: Learning small-size DNN with output-distribution-based criteria. In: Proceedings of Interspeech (2014)
Google Scholar
Li, J., et al.: High-accuracy and low-latency speech recognition with two-head contextual layer trajectory LSTM model. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7699–7703 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054387
Ma, J., Matsoukas, S., Kimball, O., Schwartz, R.: Unsupervised training on large amounts of broadcast news data. In: Proceedings of ICASSP (2006)
Google Scholar
Manohar, V., Ghahremani, P., Povey, D., Khudanpur, S.: A teacher-student learning approach for unsupervised domain adaptation of sequence-trained ASR models. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 250–257 (2018). https://doi.org/10.1109/SLT.2018.8639635
Mirzadeh, S.I., Farajtabar, M., Li, A., Ghasemzadeh, H.: Improved knowledge distillation via teacher assistant: bridging the gap between student and teacher. arXiv preprint arXiv:1902.03393 (2019)
Moriya, T., et al.: Efficient building strategy with knowledge distillation for small-footprint acoustic models. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 21–28 (2018). https://doi.org/10.1109/SLT.2018.8639545
Munim, R.M., Inoue, N., Shinoda, K.: Sequence-level knowledge distillation for model compression of attention-based sequence-to-sequence speech recognition. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6151–6155. IEEE (2019)
Google Scholar
Parthasarathi, S.H.K., Hoffmeister, B., Matsoukas, S., Mandal, A., Strom, N., Garimella, S.: fMLLR based feature-space speaker adaptation of DNN acoustic models. In: Proceedings of Interspeech (2015)
Google Scholar
Parthasarathi, S.H.K., Sivakrishnan, N., Ladkat, P., Strom, N.: Realizing petabyte scale acoustic modeling. IEEE J. Emerg. Sel. Top. Circuits Syst. 9, 422-432 (2019)
Google Scholar
Parthasarathi, S.H.K., Strom, N.: Lessons from building acoustic models from a million hours of speech. In: Proceedings of ICASSP (2019)
Google Scholar
Pundak, G., Sainath, T.: Lower frame rate neural network acoustic models. In: Proceedings of Interspeech (2016)
Google Scholar
Sak, H., Senior, A., Rao, K., Beaufays, F.: Fast and accurate recurrent neural network acoustic models for speech recognition. In: INTERSPEECH (2015)
Google Scholar
Shi, W., Cao, J., Zhang, Q., Li, Y., Xu, L.: Edge computing: vision and challenges. IEEE Internet Things J. 3, 637–646 (2016)
Article Google Scholar
Siu, M.H., Gish, H., Richardson, F.: Improved estimation, evaluation and applications of confidence measures for speech recognition. In: Proceedings of European Conference on Speech Communication and Technology (1997)
Google Scholar
Strom, N.: Scalable distributed DNN training using commodity GPU cloud computing. In: Proceedings of Interspeech (2015)
Google Scholar
Watanabe, S., Hori, T., Le Roux, J., Hershey, J.R.: Student-teacher network learning with enhanced features. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5275–5279 (2017). https://doi.org/10.1109/ICASSP.2017.7953163
Waters, A., Chebotar, Y.: Distilling knowledge from ensembles of neural networks for speech recognition. In: Interspeech (2016)
Google Scholar
Weninger, F., Mana, F., Gemello, R., Andres-Ferrer, J., Zhan, P.: Semi-supervised learning with data augmentation for end-to-end ASR (2020)
Google Scholar
Wong, J.H.M., Gales, M.: Sequence student-teacher training of deep neural networks. In: INTERSPEECH (2016)
Google Scholar
Xie, Q., Dai, Z., Hovy, E., Luong, M.T., Le, Q.V.: Unsupervised data augmentation for consistency training (2020)
Google Scholar
Yanzhang, H., et al.: Streaming end-to-end speech recognition for mobile devices. arXiv preprint arXiv:1811.06621 (2018)
Yu, W., Liang, F., He, X., Hatcher, W.G., Lu, C., Lin, J., Yang, X.: A survey on the edge computing for the Internet of Things. IEEE Access 6, 6900–6919 (2017)
Article Google Scholar
Zhang, Y., et al.: Pushing the limits of semi-supervised learning for automatic speech recognition (2020)
Google Scholar

Download references

Acknowledgements

We would like to thank Minhua Wu, Jangwon Kim, Srinivas Parthasarathy, Kishore Nandury and Brian King for their helpful discussions.

Author information

Authors and Affiliations

Alexa Machine Learning, Amazon, Seattle, USA
Jing Liu, Rupak Vignesh Swaminathan, Sree Hari Krishnan Parthasarathi, Athanasios Mouchtaris & Siegfried Kunzmann
School of Informatics, University of Edinburgh, Edinburgh, Scotland, UK
Chunchuan Lyu

Authors

Jing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Rupak Vignesh Swaminathan
View author publications
You can also search for this author in PubMed Google Scholar
Sree Hari Krishnan Parthasarathi
View author publications
You can also search for this author in PubMed Google Scholar
Chunchuan Lyu
View author publications
You can also search for this author in PubMed Google Scholar
Athanasios Mouchtaris
View author publications
You can also search for this author in PubMed Google Scholar
Siegfried Kunzmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jing Liu .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein
University of West Bohemia, Pilsen, Czech Republic
František Pártl
University of West Bohemia, Pilsen, Czech Republic
Miloslav Konopík

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, J., Swaminathan, R.V., Parthasarathi, S.H.K., Lyu, C., Mouchtaris, A., Kunzmann, S. (2021). Exploiting Large-Scale Teacher-Student Training for On-Device Acoustic Models. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science(), vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_35

Download citation

DOI: https://doi.org/10.1007/978-3-030-83527-9_35
Published: 30 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83526-2
Online ISBN: 978-3-030-83527-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics