Skip to main content
Log in

A Speaker Verification Method Based on TDNN–LSTMP

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

In speaker recognition, a robust recognition method is essential. This paper proposes a speaker verification method that is based on the time-delay neural network (TDNN) and long short-term memory with recurrent project layer (LSTMP) model for the speaker modeling problem in speaker verification. In this work, we present the application of the fusion of TDNN and LSTMP to the i-vector speaker recognition system that is based on the Gaussian mixture model-universal background model. By using a model that can establish long-term dependencies to create a universal background model that contains a larger amount of speaker information, it is possible to extract more feature parameters, which are speaker dependent, from the speech signal. We conducted experiments with this method on four corpora: two in Chinese and two in English. The equal error rate, minimum detection cost function and detection error tradeoff curve are used as criteria for system performance evaluation. The experimental results show that the TDNN–LSTMP/i-vector speaker recognition method outperforms the baseline system on both Chinese and English corpora and has better robustness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157 (2002). https://doi.org/10.1109/72.279181

    Article  Google Scholar 

  2. G.E. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30 (2011). https://doi.org/10.1109/TASL.2011.2134090

    Article  Google Scholar 

  3. N. Dehak, P.J. Kenny, R. Dehak, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788 (2011). https://doi.org/10.1109/TASL.2010.2064307

    Article  Google Scholar 

  4. D. Garcia-Romero, X. Zhou, C.Y. Espy-Wilson, Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition, in IEEE International Conference on Acoustics, Speech and Signal Processing (2012), pp. 4257–4260

  5. J.S. Garofolo, L.F. Lamel, W.M. Fisher, Darpa timit acoustic-phonetic continuous speech corpus cdrom. NIST speech disc 1-1.1. NASA STI/Recon Technical Report N 93 (1993)

  6. K.J. Han, S. Hahm, B.H. Kim, Deep learning-based telephony speech recognition in the wild, in INTERSPEECH (2017), pp. 1323–1327. https://doi.org/10.21437/Interspeech.2017-1695

  7. G. Hinton, L. Deng, D. Yu, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82 (2012). https://doi.org/10.1109/MSP.2012.2205597

    Article  Google Scholar 

  8. J.D. Hui Bu, Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline, in Orient. COCOSDA 2017 (2017) (Submitted)

  9. V. Joshi, N.V. Prasad, S. Umesh, Modified mean and variance normalization: transforming to utterance-specific estimates. Circuits Syst. Signal Process. 35(5), 1593 (2016). https://doi.org/10.1007/s00034-015-0129-y

    Article  Google Scholar 

  10. S. Karita, A. Ogawa, M. Delcroix, Forward-backward convolutional lstm for acoustic modeling, in INTERSPEECH (2017), pp. 1601–1605. https://doi.org/10.21437/Interspeech.2017-554

  11. P. Kenny, Bayesian speaker verification with heavy tailed priors, in Proceedings of the Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic (2010)

  12. NIST, The NIST year 2008 speaker recognition evaluation plan. https://www.nist.gov/sites/default/files/documents/2017/09/26/sre08_evalplan_release4.pdf. Accessed 3 Apr 2008

  13. V. Panayotov, G. Chen, D. Povey, Librispeech: an ASR corpus based on public domain audio books, in IEEE International Conference on Acoustics, Speech and Signal Processing (2015), pp. 5206–5210. https://doi.org/10.1109/ICASSP.2015.7178964

  14. V. Peddinti, D. Povey, S. Khudanpur, A time delay neural network architecture for efficient modeling of long temporal contexts, in INTERSPEECH (2015), pp. 3214–3218

  15. A. Poddar, M. Sahidullah, G. Saha, Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biom. 7(2), 91 (2018). https://doi.org/10.1049/iet-bmt.2017.0065

    Article  Google Scholar 

  16. D. Povey, A. Ghoshal, G. Boulianne, The kaldi speech recognition toolkit, in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. (IEEE Signal Processing Society, 2011)

  17. D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker Verification Using Adapted Gaussian Mixture Models (Academic Press Inc., London, 2000)

    Book  Google Scholar 

  18. H. Sak, A. Senior, F. Beaufays, Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. In: Interspeech, pp 338–342 (2014). https://arxiv.org/abs/1402.1128

    Google Scholar 

  19. D. Snyder, D. Garcia-Romero, D. Povey, Time delay deep neural network-based universal background models for speaker recognition. Autom. Speech Recognit. Underst. (2016). https://doi.org/10.1109/ASRU.2015.7404779

    Google Scholar 

  20. L.M. Surhone, M.T. Tennoe, S.F. Henssonow, Long Short Term Memory (Betascript Publishing, Riga, 2010)

    Google Scholar 

  21. D. Wang, X. Zhang, Thchs-30: a free chinese speech corpus. Comput. Sci. (2015). https://arxiv.org/abs/1512.01882

  22. Y. Xu, I. Mcloughlin, Y. Song, Improved i-vector representation for speaker diarization. Circuits Syst. Signal Process. 35(9), 3393 (2016). https://doi.org/10.1007/s00034-015-0206-2

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This research reported here was supported by China National Nature Science Funds (No. 31772064).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Longlian Zhao.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, H., Zhao, L. A Speaker Verification Method Based on TDNN–LSTMP. Circuits Syst Signal Process 38, 4840–4854 (2019). https://doi.org/10.1007/s00034-019-01092-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-019-01092-3

Keywords

Navigation