Skip to main content
Log in

Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Recently several speaker adaptation methods have been proposed for deep neural network (DNN) in many large vocabulary continuous speech recognition (LVCSR) tasks. However, only a few methods rely on tuning the connection weights in trained DNNs directly to optimize system performance since it is very prone to over-fitting especially when some class labels are missing in the adaptation data. In this paper, we propose a new speaker adaptation method for the hybrid NN/HMM speech recognition model based on singular value decomposition (SVD). We apply SVD on the weight matrices in trained DNNs and then tune rectangular diagonal matrices with the adaptation data. This alleviates the over-fitting problem via updating the weight matrices slightly by only modifying the singular values. We evaluate the proposed adaptation method in two standard speech recognition tasks, namely TIMIT phone recognition and large vocabulary speech recognition in the Switchboard task. Experimental results have shown that it is effective to adapt large DNN models using only a small amount of adaptation data. For example, recognition results in the Switchboard task have shown that the proposed SVD-based adaptation method may achieve up to 3-6 % relative error reduction using only a few dozens of adaptation utterances per speaker.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3

Similar content being viewed by others

References

  1. Gauvain, J.L., & Lee, C.-H. (1994). Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2, 291–298.

    Article  Google Scholar 

  2. Ahadi, S.M., & Woodland, P.C. (1997). Combined Bayesian and predictive techniques for rapid speaker adaptation of continuous density hidden Markov models. Computer Speech & Language, 11, 187–206.

    Article  Google Scholar 

  3. Leggetter, C., & Woodland, P.C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech & Language, 9, 171–185.

    Article  Google Scholar 

  4. Gales, M.J.F. (1998). Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech & Language, 12, 75–98.

    Article  Google Scholar 

  5. Digalakis, V.V., Rtischev, D., & Neumeyer, L.G. (1995). Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Transactions on Speech and Audio Processing, 3, 357–Lb366.

    Article  Google Scholar 

  6. Lee, L., & Rose, R.C. (1996). Speaker normalization using efficient frequency warping procedures. IEEE International Conference of Acoustics Speech and Signal Processing (ICASSP), 1, 353–356.

    Google Scholar 

  7. Jiang, H., Soong, F., & Lee, C.-H. (2001). Hierarchical stochastic feature matching for robust speech recognition. IEEE International Conference of Acoustics Speech and Signal Processing (ICASSP), 1, 217–220.

    Google Scholar 

  8. Neto, J., Almeida, L., Hochberg, M., Martins, C., Nunes, L., Renals, S., & Robinson, T. (1995). Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system, EUROSPEECH.

  9. Gemello, R., Mana, F., Scanzio, S., Laface, P., & De Mori, R. (2007). Linear hidden transformations for adaptation of hybrid ANN/HMM models. Speech Communication, 49, 827–835.

    Article  Google Scholar 

  10. Li, B., & Sim, K.C. (2010). Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems.

  11. Stadermann, J., & Rigoll, G. (2005). Two-stage speaker adaptation of hybrid tied-posterior acoustic models. In IEEE international conference of acoustics, speech and signal processing (ICASSP).

  12. Siniscalchi, S.M., Li, J., & Lee, C.-H. (2013). Hermitian polynomial for speaker adaptation of connectionist speech recognition systems. IEEE Transactions on Audio Speech, and Language Processing, 21, 2152–2161.

    Article  Google Scholar 

  13. Seide, F., Li, G., Chen, X., & Yu, D. (2011). Feature engineering in context-dependent deep neural networks for conversational speech transcription. In 2011 IEEE workshop on automatic speech recognition and understanding (ASRU).

  14. Yao, K., Yu, D., Seide, F., Su, H., Deng, L., & Gong, Y. (2012). Adaptation of context-dependent deep neural networks for automatic speech recognition, spoken language technology workshop (SLT).

  15. Yu, D., Yao, K., Su, H., Li, G., & Seide, F. (2013). KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In IEEE international conference of acoustics, speech and signal processing (ICASSP) (pp. 7893–7897).

  16. Wang, Y.Q, & Gales, M.J.F. (2013). Tandem system adaptation using multiple linear feature transforms. In IEEE international conference of acoustics, speech and signal processing (ICASSP) (pp. 7932–7936).

  17. Tuske, Z., Schluter, R., & Ney, H. (2013). Deep hierarchical bottleneck MRASTA features for LVCSR. In IEEE international conference of acoustics, speech and signal processing (ICASSP) (pp. 6970–6974).

  18. Abdel-Hamid, O., & Jiang, H. (2013). Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code. In IEEE international conference of acoustics, speech and signal processing (ICASSP) (pp. 7942–7946).

  19. Abdel-Hamid, O., & Jiang, H. (2013). Rapid and effective speaker adaptation of convolutional neural network based models for speech recognition, INTERSPEECH.

  20. Xue, S., Abdel-Hamid, O.s, Jiang, H., & Dai, L. (2014). Direct adaptation of hybrid DNN/HMM model for fast speaker adaptation in LVCSR based on speaker code. In IEEE international conference of acoustics, speech and signal processing (ICASSP).

  21. Xue, S., Abdel-Hamid, O., Jiang, H., & Dai, L. (2014). Speaker adaptation of deep neural network based on discriminant codes. In IEEE/ACM transactions on acoustics, speech and signal processing (p. 22).

  22. Xue, J., Li, J., Yu, D., Seltzer, M., & Gong, Y. (2014). Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network. In IEEE international conference of acoustics, speech and signal processing (ICASSP).

  23. Veselỳ, K., Ghoshal, A., Burget, L., & Povey, D. (2013). Sequence-discriminative training of deep neural networks, INTERSPEECH.

  24. Denil, M., Shakibi, B., Dinh, L., de Freitas, N., et al. (2013). Predicting parameters in deep learning, advances in neural information processing systems.

  25. Sainath, T.N., Kingsbury, B., Sindhwani, V., Arisoy, E., & Ramabhadran, B. (2013). Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In IEEE international conference of acoustics, speech and signal processing (ICASSP) (pp. 6655–6659).

  26. Xue, J., Li, J., & Gongm, Y. (2013). Restructuring of deep neural network acoustic models with singular value decomposition, INTERSPEECH.

  27. Lee, K.-F., & Hon, H.-W. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37, 1641–1648.

    Article  Google Scholar 

  28. Pan, J., Liu, C., Wang, Z., Hu, Y., & Jiang, H. (2012). Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMs in acoustic modeling. In 8th international symposium on chinese spoken language processing (ISCSLP) (pp. 301–305).

  29. Bao, Y., Jiang, H., Liu, C., Hu, Y., & Dai, L. (2012). Investigation on dimensionality reduction of concatenated features with deep neural network for LVCSR systems. IEEE 11th International Conference on Signal Processing (ICSP), 1, 562–566.

    Google Scholar 

  30. Bao, Y., Jiang, H., Dai, L., & Liu, C. (2013). Incoherent training of deep neural networks to de-correlate bottleneck features for speech recognition. In IEEE international conference of acoustics, speech and signal processing (ICASSP).

  31. Zhang, S., Bao, Y., Zhou, P., Jiang, H., & Dai, L. (2014). Improving deep neural networks for LVCSR using dropout and shrinking structure. In IEEE international conference of acoustics, speech and signal processing (ICASSP).

Download references

Acknowledgments

This work was partially supported by the National Nature Science Foundation of China (Grant No. 61273264) and the electronic information industry development fund of China (Grant No. 2013-472).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lirong Dai.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xue, S., Jiang, H., Dai, L. et al. Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition. J Sign Process Syst 82, 175–185 (2016). https://doi.org/10.1007/s11265-015-1012-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-015-1012-6

Keywords

Navigation