Abstract
We present results from Alexa speech teams on semi-supervised learning (SSL) of acoustic models (AM) with experiments spanning over 3000 h of GPU time, making our study one of the largest of its kind. We discuss SSL for AMs in a small footprint setting, showing that a smaller capacity model trained with 1 million hours of unsupervised data can outperform a baseline supervised system by 14.3% word error rate reduction (WERR). When increasing the supervised data to seven-fold, our gains diminish to 7.1% WERR; to improve SSL efficiency at larger supervised data regimes, we employ a step-wise distillation into a smaller model, obtaining a WERR of 14.4%. We then switch to SSL using larger student models in low data regimes; while learning efficiency with unsupervised data is higher, student models may outperform teacher models in such a setting. We develop a theoretical sketch to explain this behavior.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amodei, D., Ananthanarayanan, S., et al.: Deep speech 2: end-to-end speech recognition in English and Mandarin. In: Procedings of ICML (2016)
Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in Neural Information Processing Systems, pp. 2654–2662 (2014)
Bégin, L., Germain, P., Laviolette, F., Roy, J.F.: PAC-Bayesian theory for transductive learning. In: Proceedings of of AISTATS (2014)
Chen, K., Huo, Q.: Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In: Proceedings of ICASSP (2016)
Chen, L., Leutnant, V.: Acoustic model bootstrapping using semi-supervised learning. In: Proceedings Interspeech 2019, pp. 3198–3202 (2019). https://doi.org/10.21437/Interspeech.2019-2818. http://dx.doi.org/10.21437/Interspeech.2019-2818
Chen, Y., Wang, W., Wang, C.: Semi-supervised ASR by end-to-end self-training (2020)
Garimella, S., Mandal, A., Strom, N., Hoffmeister, B., Matsoukas, S., Parthasarathi, S.H.K.: Robust I-vector based adaptation of DNN acoustic model for speech recognition. In: Proceedings of Interspeech (2015)
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. arXiv preprint arXiv:1706.04599 (2017)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Huang, Y., Wang, Y., Gong, Y.: Semi-supervised training in deep learning acoustic model. In: Proceedings of Interspeech (2016)
Huang, Y., Yu, D., Gong, Y., Liu, C.: Semi-supervised GMM and DNN acoustic model training with multi-system combination and confidence re-calibration. In: Proceedings of Interspeech (2013)
Kahn, J., Lee, A., Hannun, A.: Self-training for end-to-end speech recognition. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020. https://doi.org/10.1109/icassp40776.2020.9054295. http://dx.doi.org/10.1109/ICASSP40776.2020.9054295
Kemp, T., Waibel, A.: Unsupervised training of a speech recognizer: recent experiments. In: Proceedings of Eurospeech (1999)
Kingsbury, B.: Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3761–3764 (2009). https://doi.org/10.1109/ICASSP.2009.4960445
Kurata, G., Audhkhasi, K.: Guiding CTC posterior spike timings for improved posterior fusion and knowledge distillation. CoRR abs/1904.08311 (2019). http://arxiv.org/abs/1904.08311
Lamel, L., Gauvain, J.L., Adda, G.: Lightly supervised and unsupervised acoustic model training. Comput. Speech Lang. 16, 115–129 (2002)
Li, J., Mohamed, A., Zweig, G., Gong, Y.: LSTM time and frequency recurrence for automatic speech recognition. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 187–191 (2015). https://doi.org/10.1109/ASRU.2015.7404793
Li, J., Zhao, R., Huang, J.T., Gong, Y.: Learning small-size DNN with output-distribution-based criteria. In: Proceedings of Interspeech (2014)
Li, J., et al.: High-accuracy and low-latency speech recognition with two-head contextual layer trajectory LSTM model. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7699–7703 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054387
Ma, J., Matsoukas, S., Kimball, O., Schwartz, R.: Unsupervised training on large amounts of broadcast news data. In: Proceedings of ICASSP (2006)
Manohar, V., Ghahremani, P., Povey, D., Khudanpur, S.: A teacher-student learning approach for unsupervised domain adaptation of sequence-trained ASR models. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 250–257 (2018). https://doi.org/10.1109/SLT.2018.8639635
Mirzadeh, S.I., Farajtabar, M., Li, A., Ghasemzadeh, H.: Improved knowledge distillation via teacher assistant: bridging the gap between student and teacher. arXiv preprint arXiv:1902.03393 (2019)
Moriya, T., et al.: Efficient building strategy with knowledge distillation for small-footprint acoustic models. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 21–28 (2018). https://doi.org/10.1109/SLT.2018.8639545
Munim, R.M., Inoue, N., Shinoda, K.: Sequence-level knowledge distillation for model compression of attention-based sequence-to-sequence speech recognition. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6151–6155. IEEE (2019)
Parthasarathi, S.H.K., Hoffmeister, B., Matsoukas, S., Mandal, A., Strom, N., Garimella, S.: fMLLR based feature-space speaker adaptation of DNN acoustic models. In: Proceedings of Interspeech (2015)
Parthasarathi, S.H.K., Sivakrishnan, N., Ladkat, P., Strom, N.: Realizing petabyte scale acoustic modeling. IEEE J. Emerg. Sel. Top. Circuits Syst. 9, 422-432 (2019)
Parthasarathi, S.H.K., Strom, N.: Lessons from building acoustic models from a million hours of speech. In: Proceedings of ICASSP (2019)
Pundak, G., Sainath, T.: Lower frame rate neural network acoustic models. In: Proceedings of Interspeech (2016)
Sak, H., Senior, A., Rao, K., Beaufays, F.: Fast and accurate recurrent neural network acoustic models for speech recognition. In: INTERSPEECH (2015)
Shi, W., Cao, J., Zhang, Q., Li, Y., Xu, L.: Edge computing: vision and challenges. IEEE Internet Things J. 3, 637–646 (2016)
Siu, M.H., Gish, H., Richardson, F.: Improved estimation, evaluation and applications of confidence measures for speech recognition. In: Proceedings of European Conference on Speech Communication and Technology (1997)
Strom, N.: Scalable distributed DNN training using commodity GPU cloud computing. In: Proceedings of Interspeech (2015)
Watanabe, S., Hori, T., Le Roux, J., Hershey, J.R.: Student-teacher network learning with enhanced features. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5275–5279 (2017). https://doi.org/10.1109/ICASSP.2017.7953163
Waters, A., Chebotar, Y.: Distilling knowledge from ensembles of neural networks for speech recognition. In: Interspeech (2016)
Weninger, F., Mana, F., Gemello, R., Andres-Ferrer, J., Zhan, P.: Semi-supervised learning with data augmentation for end-to-end ASR (2020)
Wong, J.H.M., Gales, M.: Sequence student-teacher training of deep neural networks. In: INTERSPEECH (2016)
Xie, Q., Dai, Z., Hovy, E., Luong, M.T., Le, Q.V.: Unsupervised data augmentation for consistency training (2020)
Yanzhang, H., et al.: Streaming end-to-end speech recognition for mobile devices. arXiv preprint arXiv:1811.06621 (2018)
Yu, W., Liang, F., He, X., Hatcher, W.G., Lu, C., Lin, J., Yang, X.: A survey on the edge computing for the Internet of Things. IEEE Access 6, 6900–6919 (2017)
Zhang, Y., et al.: Pushing the limits of semi-supervised learning for automatic speech recognition (2020)
Acknowledgements
We would like to thank Minhua Wu, Jangwon Kim, Srinivas Parthasarathy, Kishore Nandury and Brian King for their helpful discussions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, J., Swaminathan, R.V., Parthasarathi, S.H.K., Lyu, C., Mouchtaris, A., Kunzmann, S. (2021). Exploiting Large-Scale Teacher-Student Training for On-Device Acoustic Models. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science(), vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_35
Download citation
DOI: https://doi.org/10.1007/978-3-030-83527-9_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83526-2
Online ISBN: 978-3-030-83527-9
eBook Packages: Computer ScienceComputer Science (R0)