Abstract
This paper investigates how deep bottleneck neural networks can be used to combine the benefits of both i-vectors and speaker-adaptive feature transformations. We show how a GMM-based speech recognizer can be greatly improved by applying feature-space maximum likelihood linear regression (fMLLR) transformation to outputs of a deep bottleneck neural network trained on a concatenation of regular Mel filterbank features and speaker i-vectors. The addition of the i-vectors reduces word error rate of the GMM system by 3–7% compared to an identical system without i-vectors. We also examine Deep Neural Network (DNN) systems trained on various combinations of i-vectors, fMLLR-transformed bottleneck features and other feature space transformations. The best approach results speaker-adapted DNNs which showed 15–19% relative improvement over a strong speaker-independent DNN baseline.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Cardinal, P., Dehak, N., Zhang, Y., Glass, J.: Speaker adaptation using the i-vector technique for bottleneck features. In: Proceedings of Interspeech, vol. 2015 (2015)
Cettolo, M., Niehues, J., Stüker, S., Bentivogli, L., Frederico, M.: Report on the 10th iwslt evaluation campaign. In: The International Workshop on Spoken Language Translation (IWSLT) (2013)
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
Finke, M., Geutner, P., Hild, H., Kemp, T., Ries, K., Westphal, M.: The Karlsruhe VERBMOBIL speech recognition engine. In: Proceedings of ICASSP (1997)
Gales, M.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)
Gales, M.J.: Semi-tied covariance matrices for hidden markov models. IEEE Trans. Speech Audio Process. 7(3), 272–281 (1999)
Gauvain, J.L., Lee, C.H.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2(2), 291–298 (1994)
Gehring, J., Miao, Y., Metze, F., Waibel, A.: Extracting deep bottleneck features using stacked auto-encoders. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3377–3381 (2013)
Graff, D.: The 1996 broadcast news speech and language-model corpus. In: Proceedings of the DARPA Workshop on Spoken Language Technology (1997)
Kaukoranta, T., Franti, P., Nevalainen, O.: A new iterative algorithm for VQ codebook generation. In: Proceedings of the 1998 International Conference on Image Processing, ICIP 1998, vol. 2, pp. 589–593 (1998)
Liao, H.: Speaker adaptation of context dependent deep neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7947–7951. IEEE (2013)
Miao, Y., Zhang, H., Metze, F.: Speaker adaptive training of deep neural network acoustic models using i-vectors. IEEE/ACM Trans. Audio Speech Lang. Process. 23(11), 1938–1949 (2015)
Mohamed, A.R., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4273–4276 (2012)
Parthasarathi, S.H.K., Hoffmeister, B., Matsoukas, S., Mandal, A., Strom, N., Garimella, S.: fMLLR based feature-space speaker adaptation of DNN acoustic models. In: Proceedings of Interspeech, vol. 2015 (2015)
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, No. EPFL-CONF-192584 (2011)
Rath, P.S., Povey, D., Veselý, K., Černocký, J.: Improved feature processing for deep neural networks. In: Proceedings of Interspeech 2013, vol. 8, pp. 109–113 (2013)
Rousseau, A., Deléglise, P., Estève, Y.: Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In: LREC (2014)
Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: ASRU, pp. 55–59 (2013)
Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: 2011 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 24–29. IEEE (2011)
Senior, A., Lopez-Moreno, I.: Improving DNN speaker independence with i-vector inputs. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014)
Stüker, S., Kilgour, K., Kraft, F.: Quaero 2010 speech-to-text evaluation systems. In: Nagel, W., Kröner, D., Resch, M. (eds.) High Performance Computing in Science and Engineering 2011, pp. 607–618. Springer, Heidelberg (2012). doi:10.1007/978-3-642-23869-7_44
Tan, T., Qian, Y., Yu, D., Kundu, S., Lu, L., Sim, K.C., Xiao, X., Zhang, Y.: Speaker-aware training of LSTM-RNNS for acoustic modelling. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5280–5284. IEEE (2016)
Tomashenko, N., Khokhlov, Y., Esteve, Y.: On the use of gaussian mixture model framework to improve speaker adaptation of deep neural network acoustic models. In: Proceedings of INTERSPEECH (2016)
Williams, W., Prasad, N., Mrva, D., Ash, T., Robinson, T.: Scaling recurrent neural network language models. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5391–5395 (2015)
Yu, D., Seltzer, M.L.: Improved bottleneck features using pretrained deep neural networks. In: Interspeech, vol. 237, p. 240 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Nguyen, T.S., Kilgour, K., Sperber, M., Waibel, A. (2017). Improved Speaker Adaptation by Combining I-vector and fMLLR with Deep Bottleneck Networks. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_41
Download citation
DOI: https://doi.org/10.1007/978-3-319-66429-3_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66428-6
Online ISBN: 978-3-319-66429-3
eBook Packages: Computer ScienceComputer Science (R0)