Abstract
Deep neural networks (DNNs) have been successfully applied to many pattern classification problems, including acoustic modelling for automatic speech recognition (ASR). However, DNN adaptation remains a challenging task. Many approaches have been proposed in recent years to improve the adaptability of DNNs to achieve robust ASR. This chapter will review the recent adaptation methods for DNNs, broadly categorising them into constrained adaptation, feature normalisation, feature augmentation and structured DNN parameterisation. Specifically, we will describe various methods of estimating reliable representations for feature augmentation, focusing primarily on comparing i-vectors and other bottleneck features. Moreover, we will also present an adaptable DNN layer parameterisation scheme based on a linear interpolation structure. The interpolation weights can be reliably adjusted to adapt the DNN to different conditions. This generic scheme subsumes many existing DNN adaptation methods, including speaker-code adaptation, learning hidden unit contribution factorised hidden layer and cluster adaptive training for DNNs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
A speaker super-vector is a concatenation of the mean vectors of a Gaussian mixture model that represents the feature distribution for each speaker.
References
Abdel-Hamid, O., Jiang, H.: Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 7942–7946 (2013)
Abrash, V., Franco, H., Sankar, A., Cohen, M.: Connectionist speaker normalization and adaptation. In: Eurospeech, pp. 2183–2186. ISCA (1995)
Chunyang, W., Gales, M.J.: Multi-basis adaptive neural network for rapid adaptation in speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 4315–4319. IEEE (2015)
Chunyang, W., Karanasou, P., Gales, M.J.: Combining i-vector representation and structured neural networks for rapid adaptation. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 5000–5004. IEEE (2016)
Cui, X., Goel, V., Kingsbury, B.: Data augmentation for deep neural network acoustic modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 23(9), 1469–1477 (2015)
Dahl, G., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)
Dehak, N., Dehak, R., Kenny, P., Brümmer, N., Ouellet, P., Dumouchel, P.: Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. In: Proceedings of Interspeech, vol. 9, pp. 1559–1562 (2009)
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
Delcroix, M., Kinoshita, K., Hori, T., Nakatani, T.: Context adaptive deep neural networks for fast acoustic model adaptation. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 4535–4539. IEEE (2015)
Delcroix, M., Kinoshita, K., Chengzhu, Y., Atsunori, O.: Context adaptive deep neural networks for fast acoustic model adaptation in noise conditions. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 5270–5274. IEEE (2016)
Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)
Gales, M.J.: Cluster adaptive training of hidden Markov models. IEEE Trans. Speech Audio Process. 8(4), 417–428 (2000)
Gauvain, J.L., Lee, C.H.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2(2), 291–298 (1994)
Gemello, R., Mana, F., Scanzio, S., Laface, P., Mori, R.D.: Adaptation of hybrid ANN/HMM models using linear hidden transformations and conservative training. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 1189–1192. IEEE (2006)
Giri, R., Seltzer, M.L., Droppo, J., Yu, D.: Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 5014–5018 (2015)
Grézl, F., Fousek, P.: Optimizing bottle-neck features for LVCSR. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 4729–4732 (2008)
Grézl, F., Karafiat, M., Kontar, S., Cernocky, J.: Probabilistic and bottle-neck features for LVCSR of meetings. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, vol. 4, pp. 757–760 (2007)
Grézl, F., Karafiát, M., Janda, M.: Study of probabilistic and bottle-neck features in multilingual environment. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 359–364 (2011)
Gupta, V., Kenny, P., Ouellet, P., Stafylakis, T.: I-vector-based speaker adaptation of deep neural networks for French broadcast audio transcription. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 6334–6338 (2014)
Hain, T., Burget, L., Dines, J., Garner, P.N., Grézl, F., Hannani, A.E., Huijbregts, M., Karafiat, M., Lincoln, M., Wan, V.: Transcribing meetings with the AMIDA systems. IEEE Trans. Audio Speech Lang. Process. 20(2), 486–498 (2012)
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29, 82–97 (2012)
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580 (2012, arXiv preprint)
Hirsch, G.: Experimental framework for the performance evaluation of speech recognition front-ends on a large vocabulary task, version 2.0. ETSI STQ-Aurora DSR Working Group (2002)
Huang, H., Sim, K.C.: An investigation of augmenting speaker representations to improve speaker normalisation for DNN-based speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 4610–4613 (2015)
Ishii, T., Komiyama, H., Shinozaki, T., Horiuchi, Y., Kuroiwa, S.: Reverberant speech recognition based on denoising autoencoder. In: Proceedings of Interspeech, pp. 3512–3516 (2013)
Karanasou, P., Wang, Y., Gales, M.J.F., Woodland, P.C.: Adaptation of deep neural network acoustic models using factorised i-vectors. In: Proceedings of Interspeech, pp. 2180–2184 (2014)
Karanasou, P., Gales, M.J.F., Woodland, P.C.: I-vector estimation using informative priors for adaptation of deep neural networks. In: Interspeech, pp. 2872–2876 (2015)
Kenny, P., Ouellet, P., Dehak, N., Gupta, V., Dumouchel, P.: A study of interspeaker variability in speaker verification. IEEE Trans. Audio Speech Lang. Process. 16(5), 980–988 (2008)
Knapp, C.H., Carter, G.C.: The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 24(4), 320–327 (1976)
Kumar, K., Singh, R., Raj, B., Stern, R.: Gammatone sub-band magnitude-domain dereverberation for ASR. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 4604–4607. IEEE (2011)
Kumar, K., Liu, C., Yao, K., Gong, Y.: Intermediate-layer DNN adaptation for offline and session-based iterative speaker adaptation. In: Proceedings of Interspeech. ISCA (2015)
Kundu, S., Mantena, G., Qian, Y., Tan, T., Delcroix, M., Sim, K.C.: Joint acoustic factor learning for robust deep neural network based automatic speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP (2016)
Leggetter, C.J., Woodland, P.C.: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech Lang. 9(2), 171–185 (1995)
Li, B., Sim, K.: Comparison of discriminative input and output transformation for speaker adaptation in the hybrid NN/HMM systems. In: Proceedings of Interspeech, pp. 526–529. ISCA (2010)
Li, B., Sim, K.C.: Noise adaptive front-end normalization based on vector Taylor series for deep neural networks in robust speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 7408–7412. IEEE (2013)
Liao, H.: Speaker adaptation of context dependent deep neural networks. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 7947–7951. IEEE (2013)
Lippman, R.P., Martin, E.A., Paul, D.B.: Multi-style training for robust isolated-word speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, vol. 12, pp. 705–708. IEEE (1987)
Liu, S., Sim, K.C.: Temporally varying weight regression: a semi-parametric trajectory model for automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(1), 151–160 (2014)
Liu, Y., Karanasou, P., Hain, T.: An investigation into speaker informed DNN front-end for LVCSR. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 4300–4304 (2015)
Lu, X., Tsao, Y., Matsuda, S., Hori, C.: Speech enhancement based on deep denoising autoencoder. In: Proceedings of Interspeech, pp. 436–440 (2013)
Miao, Y., Metze, F.: Distance-aware DNNS for robust speech recognition. In: Proceedings of Interspeech (2015)
Miao, Y., Jiang, L., Zhang, H., Metze, F.: Improvements to speaker adaptive training of deep neural networks. In: IEEE Spoken Language Technology Workshop (SLT), 2014, pp. 165–170. IEEE (2014)
Miao, Y., Zhang, H., Metze, F.: Towards speaker adaptive training of deep neural network acoustic models. In: Proceedings of Interspeech, pp. 2189–2193 (2014)
Moreno, P.J., Raj, B., Stern, R.M.: A vector Taylor series approach for environment-independent speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, vol. 2, pp. 733–736. IEEE (1996)
Nagamine, T., Seltzer, M.L., Mesgarani, N.: Exploring how deep neural networks form phonemic categories. In: Proceedings of Interspeech (2015)
Naylor, P.A., Gaubitch, N.D.: Speech Dereverberation. Springer Science & Business Media, London (2010)
Neto, J., Almeida, L., Hochberg, M., Martins, C., Nunes, L., Renals, S., Robinson, T.: Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system. In: Proceedings of Interspeech. ISCA (1995)
Peddinti, V., Chen, G., Povey, D., Khudanpur, S.: Reverberation robust acoustic modeling using i-vectors with time delay neural networks. In: Proceedings of Interspeech (2015)
Qian, Y., Yin, M., You, Y., Yu, K.: Multi-task joint-learning of deep neural networks for robust speech recognition. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Scottsdale, AZ, pp. 310–316 (2015)
Qian, Y., Tan, T., Yu, D.: An investigation into using parallel data for far-field speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Shanghai, China, pp. 5725–5729 (2016)
Qian, Y., Tan, T., Yu, D., Zhang, Y.: Integrated adaptation with multi-factor joint-learning for far-field speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Shanghai, pp. 5770–5774 (2016)
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Sainath, T.N., Kingsbury, B., Sindhwani, V., Arisoy, E., Ramabhadran, B.: Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6655–6659. IEEE (2013)
Samarakoon, L., Sim, K.C.: Factorized hidden layer adaptation for deep neural network based acoustic modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 24(12), 2241–2250 (2016)
Samarakoon, L., Sim, K.C.: On combining i-vectors and discriminative adaptation methods for unsupervised speaker normalisation in DNN acoustic models. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP (2016)
Samarakoon, L., Sim, K.C.: Subspace LHUC for fast adaptation of deep neural network acoustic models. In: Interspeech (2016)
Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 55–59 (2013)
Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 24–29. IEEE (2011)
Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-dependent deep neural networks. In: Proceedings of Interspeech, pp. 437–440 (2011)
Seltzer, M.L., Yu, D., Wang, Y.: An investigation of deep neural networks for noise robust speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 7398–7402 (2013)
Senior, A., Moreno, I.L.: Improving DNN speaker independence with i-vector inputs. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 225–229 (2014)
Shaofei, X., Abdel-Hamid, O., Hui, J., Lirong, D.: Direct adaptation of hybrid DNN/HMM model for fast speaker adaptation in LVCSR based on speaker code. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 6339–6343. IEEE (2014)
Shaofei, X., Abdel-Hamid, O., Hui, J., Lirong, D., Qingfeng, L.: Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1713–1725 (2014)
Shilin, L., Sim, K.C.: Joint adaptation and adaptive training of TVWR for robust automatic speech recognition. In: Proceedings of Interspeech (2014)
Sim, K.C.: On constructing and analysing an interpretable brain model for the DNN based on hidden activity patterns. In: Proceedings of Automatic Speech Recognition and Understanding (ASRU), pp. 22–29 (2015)
Stadermann, J., Rigoll, G.: Two-stage speaker adaptation of hybrid tied-posterior acoustic models. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 977–980 (2005)
Swietojanski, P., Renals, S.: Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models. In: Proceedings of IEEE Spoken Language Technology Workshop (SLT), pp. 171–176. IEEE (2014)
Swietojanski, P., Renals, S.: SAT-LHUC: speaker adaptive training for learning hidden unit contributions. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP. IEEE (2016)
Swietojanski, P., Ghoshal, A., Renals, S.: Hybrid acoustic models for distant and multichannel large vocabulary speech recognition. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 285–290 (2013)
Tan, S., Sim, K.C., Gales, M.: Improving the interpretability of deep neural networks with stimulated learning. In: Proceedings of Automatic Speech Recognition and Understanding (ASRU), pp. 617–623 (2015)
Tan, T., Qian, Y., Yin, M., Zhuang, Y., Yu, K.: Cluster adaptive training for deep neural network. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Brisbane, pp. 4325–4329 (2015)
Tan, T., Qian, Y., Yu, K.: Cluster adaptive training for deep neural network based acoustic model. IEEE/ACM Trans. Audio Speech Lang. Process. 24(03), 459–468 (2016)
Trmal, J., Zelinka, J., Müller, L.: Adaptation of a feedforward artificial neural network using a linear transform. In: Sojka, P., et al. (eds.) Text, Speech and Dialogue, pp. 423–430. Springer, Berlin/Heidelberg (2010)
Variani, E., McDermott, E., Heigold, G.: A Gaussian mixture model layer jointly optimized with discriminative features within a deep neural network architecture. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 4270–4274. IEEE (2015)
Vesely, K., Karafiat, M., Grezl, F., Janda, M., Egorova, E.: The language-independent bottleneck features. In: Proceedings of IEEE Spoken Language Technology Workshop (SLT), pp. 336–341 (2012)
Vu, N.T., Metze, F., Schultz, T.: Multilingual bottle-neck features and its application for under-resourced languages. In: Proceedings of Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU), pp. 90–93 (2012)
Wu, C., Karanasou, P., Gales, M.J., Sim, K.C.: Stimulated deep neural network for speech recognition. In: Proceedings of Interspeech, pp. 400–404. ISCA (2016)
Xiao, Y., Zhang, Z., Cai, S., Pan, J., Yan, Y.: A initial attempt on task-specific adaptation for deep neural network-based large vocabulary continuous speech recognition. In: Proceedings of Interspeech. ISCA (2012)
Xu, Y., Du, J., Dai, L.R., Lee, C.H.: An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. 21(1), 65–68 (2014)
Xu, Y., Du, J., Dai, L.R., Lee, C.H.: A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2015)
Xue, J., Li, J., Gong, Y.: Restructuring of deep neural network acoustic models with singular value decomposition. In: Proceedings of Interspeech, pp. 2365–2369. ISCA (2013)
Xue, J., Li, J., Yu, D., Seltzer, M., Gong, Y.: Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP, pp. 6359–6363. IEEE (2014)
Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L.: Direct adaptation of hybrid DNN/HMM model for fast speaker adaptation in LVCSR based on speaker code. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 6339–6343 (2014)
Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., Liu, Q.: Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1713–1725 (2014)
Xue, S., Jiang, H., Dai, L.: Speaker adaptation of hybrid NN/HMM model for speech recognition based on singular value decomposition. In: ISCSLP, pp. 1–5. IEEE (2014)
Yanmin Qian, T.T., Yu, D.: Neural network based multi-factor aware joint training for robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(12), 2231–2240 (2016)
Yao, K., Yu, D., Seide, F., Su, H., Deng, L., Gong, Y.: Adaptation of context-dependent deep neural networks for automatic speech recognition. In: IEEE Spoken Language Technology Workshop (SLT), 2012, pp. 366–369. IEEE (2012)
Yoshioka, T., Sehr, A., Delcroix, M., Kinoshita, K., Maas, R., Nakatani, T., Kellermann, W.: Making machines understand us in reverberant rooms: robustness against reverberation for automatic speech recognition. IEEE Signal Process. Mag. 29(6), 114–126 (2012)
Yu, D., Deng, L.: Automatic Speech Recognition: A Deep Learning Approach. Springer, London (2014)
Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 7893–7897. IEEE (2013)
Zhang, Y., Yu, D., Seltzer, M.L., Droppo, J.: Speech recognition with prediction–adaptation–correction recurrent neural networks. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 5004–5008 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Sim, K.C., Qian, Y., Mantena, G., Samarakoon, L., Kundu, S., Tan, T. (2017). Adaptation of Deep Neural Network Acoustic Models for Robust Automatic Speech Recognition. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-64680-0_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)