Offline to online speaker adaptation for real-time deep neural network based LVCSR systems

Long, Yanhua; Li, Yijie; Zhang, Bo

doi:10.1007/s11042-018-6041-2

Offline to online speaker adaptation for real-time deep neural network based LVCSR systems

Published: 27 April 2018

Volume 77, pages 28101–28119, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Yanhua Long¹,
Yijie Li² &
Bo Zhang¹

245 Accesses
1 Citation
Explore all metrics

Abstract

In this study, we investigate an offline to online strategy for speaker adaptation of automatic speech recognition systems. These systems are trained using the traditional feed-forward and the recent proposed lattice-free maximum mutual information (MMI) time-delay deep neural networks. In this strategy, the test speaker identity is modeled as an iVector which is offline estimated and then used in an online style during speech decoding. In order to ensure the quality of iVectors, we introduce a speaker enrollment stage which can ensure sufficient reliable speech for estimating an accurate and stable offline iVector. Furthermore, different iVector estimation techniques are also reviewed and investigated for speaker adaptation in large vocabulary continuous speech recognition (LVCSR) tasks. Experimental results on several real-time speech recognition tasks demonstrate that, the proposed strategy can not only provide a fast decoding speed, but also can result in significant reductions in word error rates (WERs) than traditional iVector based speaker adaptation frameworks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

End-to-end speaker identification research based on multi-scale SincNet and CGAN

Article 02 August 2023

Online Neural Speaker Diarization with Core Samples

A deep learning approach for speaker recognition

Article 18 December 2019

References

Amodei D, Anubhai R, Battenberg E (2016) Deep speech 2: end-to-end speech recognition in English and Mandarin. In: ICML. IEEE, pp 173–182
Bahdanau D, Chorowski J, Serdyuk D et al. (2016) End-to-end attention-based large vocabulary speech recognition. In: ICASSP. IEEE, pp 4945–4949
Dahl G E, Yu D, Deng L, Acero A (2012) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 20(1):30–42
Article Google Scholar
Dehak N, Kenny P J, Dehak R et al. (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798
Article Google Scholar
Gales M J F (1998) Maximum likelihood linear transformations for HMM-based speech recognition. Comput Speech Lang. 12(2):75–98
Article Google Scholar
Garimella S, Mandal A, Strom N (2015) Robust i-vector based adaptation of dnn acoustic model for speech recognition. In: Interspeech. ISCA, pp 2877–2881
Gauvain JL, Lee CH (1990) Bayesian learning of Gaussian mixture densities for hidden Markov models. In: DARPA speech and natural language workshop, pp 272–277
Ghahremani P, Manohar V, Povey D, Khudanpur S (2016) Acoustic modelling from the signal domain using CNNs. In: Interspeech. ISCA, pp 1996–2000
Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: ICASSP. IEEE, pp 6645–6649
Hasan T, Saeidi R, Hansen JH, van Leeuwen DA (2013) Duration mismatch compensation for i-vector based speaker recognition systems. In: ICASSP. IEEE, pp 7663–7667
Hinton G, Deng L, Yu D et al. (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97
Article Google Scholar
Huang Z, Tang J, Xue S, Dai L (2016) Speaker adaptation OF RNN-BLSTM for speech recognition based on speaker code. In: ICASSP. IEEE, pp 5305–5309
Kenny P, Boulianne G, Dumouchel P (2005) Eigenvoice modeling with sparse training data. IEEE Trans Speech Audio Process 13(3):345–354
Article Google Scholar
Kenny P, Gupta V, Stafylakis T, Ouellet P, Alam J (2014) Deep neural networks for extracting Baum-Welch statistics for speaker recognition. In: Odyssey. IEEE, pp 293–298
Lei Y, Scheffer N, Ferrer L, McLaren M (2014) A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: ICASSP. IEEE, pp 1695–1699
Leggetter C J, Woodland P C (1997) Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov model. Comput Speech Lang 9(2):171–185
Article Google Scholar
Liao H (2013) Speaker adaptation of context dependent deep neural networks. In: ICASSP. IEEE, pp 7947–7951
Liu C, Wang Y, Kumar K, Gong Y (2016) Investigations on speaker adaptation of LSTM RNN models for speech recognition. In: ICASSP. IEEE, pp 5020–5024
Miao Y, Zhang H, Metze F (2015) Speaker adaptive training of deep neural network acoustic models using i-vectors. IEEE/ACM Trans Audio Speech Lang Process 23(11):1938–1949
Article Google Scholar
Peddinti V, Povey D, Khudanpur S (2015) A time delay neural network architecture for different modeling of long temporal contexts. In: Interspeech. ISCA, pp 3214–3218
Peddinti V, Chen G, Manohar V, Povey D, Khudanpur S (2015) HU ASpIRE system: Robust LVCSR with TDNNS, iVector adaptation and RNN-LMS. In: ASRU. IEEE, pp 539–546
Povey D (2016) Nnet3: Neural network toolkit for generic acyclic computation graphs. http://www.danielpovey.com/kaldi-docs/dnn3code.html. Accessed 20 Jan 2017
Povey D, Ghoshal A, Boulianne G et al. (2011) The Kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, December 2011
Povey D, Zhang X, Khudanpur S (2014) Parallel training of deep neural networks with natural gradient and parameter averaging. arXiv:http://arXiv.org/abs/1410.7455. Accessed 12 Jan 2017
Povey D, Peddinti V, Galvez D et al. (2016) Purely sequence-trained neural networks for ASR based on lattice-free MMI. In: Interspeech. ISCA, pp 2751–2755
Richardson F, Reynolds D, Dehak N (2015) A unified deep neural network for speaker and language recognition. In: Interspeech. ISCA, pp 1146–1150
Sainath TN, Vinyals O, Senior A, Sak H (2015) Convolutional, long short-term memory, fully connected deep neural networks. In: ICASSP. IEEE, pp 338–342
Sak H, Senior A, Rao K, Beaufays F (2015) Fast and accurate recurrent neural network acoustic models for speech recognition. In: Interspeech. ISCA, pp 1468–1472
Saon G, Soltau H, Nahamoo D, Picheny M (2013) Speaker adaptation of neural network acoustic models using i-vectors. In: ASRU. IEEE, pp 55–59
Seide F, Li G, Chen X, Yu D (2011) Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: ASRU. IEEE, pp 24–29
Senior A, Lopez-Moreno I (2014) Improving DNN speaker independence with i-vector inputs. In: ICASSP. IEEE, pp 225–229
Senior A, Sak H, Chaumont Quitry F et al. (2015) Acoustic modeling with CD-CTC-SMBR LSTM RNNs. In: ASRU. IEEE, pp 604–609
Senoussaoui M, Kenny P, Brummer N, Villiers ED, Dumouchel P (2011) Mixture of PLDA models in i-vector space for gender-independent speaker recognition. In: Interspeech. ISCA, pp 25–28
Snyder D, Garcia-Romero D, Povey D (2015) Time delay deep neural network-based universal background models for speaker recognition. In: ASRU. IEEE, pp 92–97
Swietojanski P, Renals S (2015) Differentiable pooling for unsupervised speaker adaptation. In: ICASSP. IEEE, pp 4305–4309
Tan T, Qian Y, Yu D et al. (2016) Speaker-aware training of LSTM-RNNS for acoustic modelling. In: ICASSP. IEEE, pp 5280–5284
Waibel A, Hanazawa T, Hinton G, Shikano K, Lang K (1989) Phoneme recognition using time-delay neural networks. IEEE Trans Acoust Speech Signal Process 37(3):328–339
Article Google Scholar
Xue S, Abdel-Hamid O, Jiang H et al. (2014) Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22(12):1713–1725
Article Google Scholar
Xue J, Li J, Seltzer M, Gong Y (2014) Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network. In: ICASSP. IEEE, pp 6359–6363
Xue S, Jiang H, Liu Q (2016) Speaker adaptation of hybrid NN/HMM model for speech recognition based on singular value decomposition. J Signal Process Sys 82 (2):175–185
Article Google Scholar
Yu D, Deng L (2014) Automatic speech recognition a deep learning approach. Springer, New York
MATH Google Scholar
Yu D, Yao K, Su H, Li G, Seide F (2013) KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: ICASSP. IEEE, pp 7893–7897
Zhang C, Woodland PC (2016) DNN speaker adaptation using parameterised sigmoid and ReLU hidden activation functions. In: ICASSP. IEEE, pp 5300–5304
Zhao Y, Li J, Gong Y (2016) Low-rank plus diagonal adaptation for deep neural networks. In: ICASSP. IEEE, pp 5005–5009

Download references

Author information

Authors and Affiliations

Department of Electronic and Information Engineering, Shanghai Normal University, Shanghai, 200234, China
Yanhua Long & Bo Zhang
Beijing Unisound Information Technology Co., Ltd., Beijing, 100191, China
Yijie Li

Authors

Yanhua Long
View author publications
You can also search for this author in PubMed Google Scholar
Yijie Li
View author publications
You can also search for this author in PubMed Google Scholar
Bo Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanhua Long.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Long, Y., Li, Y. & Zhang, B. Offline to online speaker adaptation for real-time deep neural network based LVCSR systems. Multimed Tools Appl 77, 28101–28119 (2018). https://doi.org/10.1007/s11042-018-6041-2

Download citation

Received: 17 August 2017
Revised: 06 March 2018
Accepted: 20 April 2018
Published: 27 April 2018
Issue Date: November 2018
DOI: https://doi.org/10.1007/s11042-018-6041-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Offline to online speaker adaptation for real-time deep neural network based LVCSR systems

Abstract

Access this article

Similar content being viewed by others

End-to-end speaker identification research based on multi-scale SincNet and CGAN

Online Neural Speaker Diarization with Core Samples

A deep learning approach for speaker recognition

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Offline to online speaker adaptation for real-time deep neural network based LVCSR systems

Abstract

Access this article

Similar content being viewed by others

End-to-end speaker identification research based on multi-scale SincNet and CGAN

Online Neural Speaker Diarization with Core Samples

A deep learning approach for speaker recognition

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation