Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition

Xue, Shaofei; Jiang, Hui; Dai, Lirong; Liu, Qingfeng

doi:10.1007/s11265-015-1012-6

Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition

Published: 10 June 2015

Volume 82, pages 175–185, (2016)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Shaofei Xue¹,
Hui Jiang²,
Lirong Dai¹ &
…
Qingfeng Liu¹

679 Accesses
17 Citations
Explore all metrics

Abstract

Recently several speaker adaptation methods have been proposed for deep neural network (DNN) in many large vocabulary continuous speech recognition (LVCSR) tasks. However, only a few methods rely on tuning the connection weights in trained DNNs directly to optimize system performance since it is very prone to over-fitting especially when some class labels are missing in the adaptation data. In this paper, we propose a new speaker adaptation method for the hybrid NN/HMM speech recognition model based on singular value decomposition (SVD). We apply SVD on the weight matrices in trained DNNs and then tune rectangular diagonal matrices with the adaptation data. This alleviates the over-fitting problem via updating the weight matrices slightly by only modifying the singular values. We evaluate the proposed adaptation method in two standard speech recognition tasks, namely TIMIT phone recognition and large vocabulary speech recognition in the Switchboard task. Experimental results have shown that it is effective to adapt large DNN models using only a small amount of adaptation data. For example, recognition results in the Switchboard task have shown that the proposed SVD-based adaptation method may achieve up to 3-6 % relative error reduction using only a few dozens of adaptation utterances per speaker.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Conventional and contemporary approaches used in text to speech synthesis: a review

Article 13 November 2022

References

Gauvain, J.L., & Lee, C.-H. (1994). Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2, 291–298.
Article Google Scholar
Ahadi, S.M., & Woodland, P.C. (1997). Combined Bayesian and predictive techniques for rapid speaker adaptation of continuous density hidden Markov models. Computer Speech & Language, 11, 187–206.
Article Google Scholar
Leggetter, C., & Woodland, P.C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech & Language, 9, 171–185.
Article Google Scholar
Gales, M.J.F. (1998). Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech & Language, 12, 75–98.
Article Google Scholar
Digalakis, V.V., Rtischev, D., & Neumeyer, L.G. (1995). Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Transactions on Speech and Audio Processing, 3, 357–Lb366.
Article Google Scholar
Lee, L., & Rose, R.C. (1996). Speaker normalization using efficient frequency warping procedures. IEEE International Conference of Acoustics Speech and Signal Processing (ICASSP), 1, 353–356.
Google Scholar
Jiang, H., Soong, F., & Lee, C.-H. (2001). Hierarchical stochastic feature matching for robust speech recognition. IEEE International Conference of Acoustics Speech and Signal Processing (ICASSP), 1, 217–220.
Google Scholar
Neto, J., Almeida, L., Hochberg, M., Martins, C., Nunes, L., Renals, S., & Robinson, T. (1995). Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system, EUROSPEECH.
Gemello, R., Mana, F., Scanzio, S., Laface, P., & De Mori, R. (2007). Linear hidden transformations for adaptation of hybrid ANN/HMM models. Speech Communication, 49, 827–835.
Article Google Scholar
Li, B., & Sim, K.C. (2010). Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems.
Stadermann, J., & Rigoll, G. (2005). Two-stage speaker adaptation of hybrid tied-posterior acoustic models. In IEEE international conference of acoustics, speech and signal processing (ICASSP).
Siniscalchi, S.M., Li, J., & Lee, C.-H. (2013). Hermitian polynomial for speaker adaptation of connectionist speech recognition systems. IEEE Transactions on Audio Speech, and Language Processing, 21, 2152–2161.
Article Google Scholar
Seide, F., Li, G., Chen, X., & Yu, D. (2011). Feature engineering in context-dependent deep neural networks for conversational speech transcription. In 2011 IEEE workshop on automatic speech recognition and understanding (ASRU).
Yao, K., Yu, D., Seide, F., Su, H., Deng, L., & Gong, Y. (2012). Adaptation of context-dependent deep neural networks for automatic speech recognition, spoken language technology workshop (SLT).
Yu, D., Yao, K., Su, H., Li, G., & Seide, F. (2013). KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In IEEE international conference of acoustics, speech and signal processing (ICASSP) (pp. 7893–7897).
Wang, Y.Q, & Gales, M.J.F. (2013). Tandem system adaptation using multiple linear feature transforms. In IEEE international conference of acoustics, speech and signal processing (ICASSP) (pp. 7932–7936).
Tuske, Z., Schluter, R., & Ney, H. (2013). Deep hierarchical bottleneck MRASTA features for LVCSR. In IEEE international conference of acoustics, speech and signal processing (ICASSP) (pp. 6970–6974).
Abdel-Hamid, O., & Jiang, H. (2013). Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code. In IEEE international conference of acoustics, speech and signal processing (ICASSP) (pp. 7942–7946).
Abdel-Hamid, O., & Jiang, H. (2013). Rapid and effective speaker adaptation of convolutional neural network based models for speech recognition, INTERSPEECH.
Xue, S., Abdel-Hamid, O.s, Jiang, H., & Dai, L. (2014). Direct adaptation of hybrid DNN/HMM model for fast speaker adaptation in LVCSR based on speaker code. In IEEE international conference of acoustics, speech and signal processing (ICASSP).
Xue, S., Abdel-Hamid, O., Jiang, H., & Dai, L. (2014). Speaker adaptation of deep neural network based on discriminant codes. In IEEE/ACM transactions on acoustics, speech and signal processing (p. 22).
Xue, J., Li, J., Yu, D., Seltzer, M., & Gong, Y. (2014). Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network. In IEEE international conference of acoustics, speech and signal processing (ICASSP).
Veselỳ, K., Ghoshal, A., Burget, L., & Povey, D. (2013). Sequence-discriminative training of deep neural networks, INTERSPEECH.
Denil, M., Shakibi, B., Dinh, L., de Freitas, N., et al. (2013). Predicting parameters in deep learning, advances in neural information processing systems.
Sainath, T.N., Kingsbury, B., Sindhwani, V., Arisoy, E., & Ramabhadran, B. (2013). Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In IEEE international conference of acoustics, speech and signal processing (ICASSP) (pp. 6655–6659).
Xue, J., Li, J., & Gongm, Y. (2013). Restructuring of deep neural network acoustic models with singular value decomposition, INTERSPEECH.
Lee, K.-F., & Hon, H.-W. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37, 1641–1648.
Article Google Scholar
Pan, J., Liu, C., Wang, Z., Hu, Y., & Jiang, H. (2012). Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMs in acoustic modeling. In 8th international symposium on chinese spoken language processing (ISCSLP) (pp. 301–305).
Bao, Y., Jiang, H., Liu, C., Hu, Y., & Dai, L. (2012). Investigation on dimensionality reduction of concatenated features with deep neural network for LVCSR systems. IEEE 11th International Conference on Signal Processing (ICSP), 1, 562–566.
Google Scholar
Bao, Y., Jiang, H., Dai, L., & Liu, C. (2013). Incoherent training of deep neural networks to de-correlate bottleneck features for speech recognition. In IEEE international conference of acoustics, speech and signal processing (ICASSP).
Zhang, S., Bao, Y., Zhou, P., Jiang, H., & Dai, L. (2014). Improving deep neural networks for LVCSR using dropout and shrinking structure. In IEEE international conference of acoustics, speech and signal processing (ICASSP).

Download references

Acknowledgments

This work was partially supported by the National Nature Science Foundation of China (Grant No. 61273264) and the electronic information industry development fund of China (Grant No. 2013-472).

Author information

Authors and Affiliations

National Engineering Laboratory of Speech and Language Information Processing, University of Science and Technology of China, Hefei, China
Shaofei Xue, Lirong Dai & Qingfeng Liu
Department of Electrical Engineering and Computer Science, York University, Toronto, Canada
Hui Jiang

Authors

Shaofei Xue
View author publications
You can also search for this author in PubMed Google Scholar
Hui Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Lirong Dai
View author publications
You can also search for this author in PubMed Google Scholar
Qingfeng Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lirong Dai.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xue, S., Jiang, H., Dai, L. et al. Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition. J Sign Process Syst 82, 175–185 (2016). https://doi.org/10.1007/s11265-015-1012-6

Download citation

Received: 15 November 2014
Revised: 22 April 2015
Accepted: 04 May 2015
Published: 10 June 2015
Issue Date: February 2016
DOI: https://doi.org/10.1007/s11265-015-1012-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Conventional and contemporary approaches used in text to speech synthesis: a review

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Conventional and contemporary approaches used in text to speech synthesis: a review

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation