Skip to main content
Log in

Integrating DNN–HMM Technique with Hierarchical Multi-layer Acoustic Model for Text-Dependent Speaker Verification

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Subspace techniques, such as i-vector/probabilistic linear discriminant analysis and joint factor analysis, have been the most commonly used techniques in the field of text-dependent speaker verification. These techniques, however, do not model the temporal structure of the pass-phrase which otherwise is an important cue in the context of text-dependent speaker verification. The hierarchical multi-layer acoustic model (HiLAM) uses Gaussian mixture model (GMM)—hidden Markov model (HMM) technique, which also accounts for the temporal information of the pass-phrase. Owing to its contextual information modeling, HiLAM has been found to outperform the subspace techniques. In this paper, we propose integrating DNN–HMM technique with HiLAM to further improve the system performance. Firstly, an attempt has been made to define a speaker-text unit/class that could characterize the speaker idiosyncrasies, which are known to be associated with shorter and more fundamental units of speech text. To this end, HiLAM is used to propose a new class definition, and the training data is aligned with respect to this class definition. The labeled data is then used to discriminatively train a deep neural network (DNN). The new method of alignment enables the neural network to learn the actual context of the pass-phrase components. This is not the case with DNN trained in automatic speech recognition fashion. Besides, the network also models the speaker idiosyncrasies associated with specific and finer text units. The use of DNN posteriors to replace the GMM likelihood probabilities of HiLAM has led to significant improvement in performance over the baseline HiLAM system. Relative EER reduction of up to 36.58% has been observed on Part 1 of RSR2015 database.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. H. Ali, S.N. Tran, E. Benetos, A.S.D.A. Garcez, Speaker recognition with hybrid features from a deep belief network. Neural Comput. Appl. 29(6), 13–19 (2018)

    Article  Google Scholar 

  2. O. Buyuk, Telephone-based text-dependent speaker verification. Ph.D. Thesis (2011)

  3. L. Chen, Y. Zhao, S.X. Zhang, J. Li, G. Ye, F. Soong, Exploring sequential characteristics in speaker bottleneck feature for text-dependent speaker verification, in Acoustics, Speech and Signal Processing (ICASSP) (2017)

  4. N. Chen, Y. Qian, K. Yu, Multi-task learning for text-dependent speaker verification, in International Speech Communication Association (Interspeech) (2015)

  5. S. Dey, S. Madikeri, M. Ferras, P. Motlicek, Deep neural network based posteriors for text-dependent speaker verification, in Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5050–5054

  6. S. Dey, P. Motlicek, S. Madikeri, M. Ferras, Template-matching for text-dependent speaker verification. Speech Commun. 88, 96–105 (2017)

    Article  Google Scholar 

  7. T. Fu, Y. Qian, Y. Liu, K. Yu, Tandem deep features for text-dependent speaker verification, in International Speech Communication Association (Interspeech) (2014)

  8. C. Hanilçi, H. Çeliktaş, Turkish text-dependent speaker verification using i-vector/PLDA approach, in 26th Signal Processing and Communications Applications Conference (SIU) (IEEE, 2018)

  9. G. Heigold, I. Moreno, S. Bengio, N. Shazeer, End-to-end text-dependent speaker verification, in Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5115–5119

  10. G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)

    Article  Google Scholar 

  11. P. Kenny, T. Stafylakis, J. Alam, P. Ouellet, M. Kockmann, Joint factor analysis for text-dependent speaker verification, in Proceedings of Odyssey Workshop (2014), pp. 1–8

  12. T. Kinnunen, Designing a speaker-discriminative adaptive filter bank for speaker recognition, in International Conference on Spoken Language Processing (2002)

  13. M. Längkvist, L. Karlsson, A. Loutfi, A review of unsupervised feature learning and deep learning for time-series modeling. Pattern Recognit. Lett. 42, 11–24 (2014)

    Article  Google Scholar 

  14. A. Larcher, J.F. Bonastre, J.S. Mason, Reinforced temporal structure information for embedded utterance-based speaker recognition, in International Speech and Communication Association (Interspeech) (2008), pp. 371–374

  15. A. Larcher, K.A. Lee, B. Ma, H. Li, Modelling the alternative hypothesis for text-dependent speaker verification, in Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 734–738

  16. A. Larcher, K.A. Lee, B. Ma, H. Li, Text-dependent speaker verification: classifiers, databases and RSR2015. Speech Commun. 60, 56–77 (2014)

    Article  Google Scholar 

  17. R.P. Lippmann, Speech recognition by machines and humans. Speech Commun. 22(1), 1–15 (1997)

    Article  Google Scholar 

  18. Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang, K. Yu, Deep feature for text-dependent speaker verification. Speech Commun. 73, 1–13 (2015)

    Article  Google Scholar 

  19. National Institute of Standards and Technology, Speaker recognition evaluation. http://www.nist.gov/speech/tests/spk. Accessed Sept 2015

  20. T.N. Sainath, B. Kingsbury, B. Ramabhadran, Improving training time of deep belief networks through hybrid pre-training and larger batch sizes, in Proceedings of NIPS Workshop on Log-Linear Models (2012)

  21. M. Sheikhan, D. Gharavian, F. Ashoftedel, Using DTW neural-based MFCC warping to improve emotional speech recognition. Neural Comput. Appl. 21(7), 1765–1773 (2012)

    Article  Google Scholar 

  22. D. Snyder, SRE16 Xvector Model 1a. http://kaldi-asr.org/models.html. Accessed Dec 2018

  23. D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, Deep neural network embeddings for text-independent speaker verification, in Proceedings of Interspeech (2017), pp. 999–1003

  24. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-vectors: robust DNN embeddings for speaker recognition, in ICASSP (2018) (Submitted)

  25. T. Stafylakis, P. Kenny, P. Ouellet, J. Perez, M. Kockmann, P. Dumouchel, Text-dependent speaker recognition using PLDA with uncertainty propagation, in Matrix, vol. 500 (2013)

  26. E. Variani, X. Lei, E. McDermott, I. Lopez-Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification, in Acoustics, Speech and Signal Processing (ICASSP), vol. 14 (2014), pp. 4052–4056

  27. Y. Xu, I. McLoughlin, Y. Song, K. Wu, Improved i-vector representation for speaker diarization. Circuits Syst. Signal Process. 35(9), 3393–3404 (2016)

    Article  MathSciNet  Google Scholar 

  28. S.J. Young, S. Young, The HTK Hidden Markov Model Toolkit: Design and Philosophy, vol. 28 (University of Cambridge, Department of Engineering, Cambridge, 1993)

    Google Scholar 

  29. H. Zeinali, H. Sameti, L. Burget, HMM-based phrase-independent i-vector extractor for text-dependent speaker verification. IEEE/ACM Trans. Audio Speech Lang. Process. 25(7), 1421–1435 (2017)

    Article  Google Scholar 

  30. H. Zeinali, H. Sameti, L. Burget, Text-dependent speaker verification based on i-vectors, neural networks and hidden Markov models. Comput. Speech Lang. 46, 53–71 (2017)

    Article  Google Scholar 

  31. Z. Zhou, G. Huang, H. Chen, J. Gao, Automatic radar waveform recognition based on deep convolutional denoising auto-encoders. Circuits Syst. Signal Process. 37, 4034–4048 (2018)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the Speech and Image Processing Laboratory of the National Institute of Technology Silchar, Silchar, for supporting the research work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammad Azharuddin Laskar.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Laskar, M.A., Laskar, R.H. Integrating DNN–HMM Technique with Hierarchical Multi-layer Acoustic Model for Text-Dependent Speaker Verification. Circuits Syst Signal Process 38, 3548–3572 (2019). https://doi.org/10.1007/s00034-019-01103-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-019-01103-3

Keywords

Navigation