Integrating DNN–HMM Technique with Hierarchical Multi-layer Acoustic Model for Text-Dependent Speaker Verification

Laskar, Mohammad Azharuddin; Laskar, Rabul Hussain

doi:10.1007/s00034-019-01103-3

Integrating DNN–HMM Technique with Hierarchical Multi-layer Acoustic Model for Text-Dependent Speaker Verification

Published: 03 April 2019

Volume 38, pages 3548–3572, (2019)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

557 Accesses
12 Citations
Explore all metrics

Abstract

Subspace techniques, such as i-vector/probabilistic linear discriminant analysis and joint factor analysis, have been the most commonly used techniques in the field of text-dependent speaker verification. These techniques, however, do not model the temporal structure of the pass-phrase which otherwise is an important cue in the context of text-dependent speaker verification. The hierarchical multi-layer acoustic model (HiLAM) uses Gaussian mixture model (GMM)—hidden Markov model (HMM) technique, which also accounts for the temporal information of the pass-phrase. Owing to its contextual information modeling, HiLAM has been found to outperform the subspace techniques. In this paper, we propose integrating DNN–HMM technique with HiLAM to further improve the system performance. Firstly, an attempt has been made to define a speaker-text unit/class that could characterize the speaker idiosyncrasies, which are known to be associated with shorter and more fundamental units of speech text. To this end, HiLAM is used to propose a new class definition, and the training data is aligned with respect to this class definition. The labeled data is then used to discriminatively train a deep neural network (DNN). The new method of alignment enables the neural network to learn the actual context of the pass-phrase components. This is not the case with DNN trained in automatic speech recognition fashion. Besides, the network also models the speaker idiosyncrasies associated with specific and finer text units. The use of DNN posteriors to replace the GMM likelihood probabilities of HiLAM has led to significant improvement in performance over the baseline HiLAM system. Relative EER reduction of up to 36.58% has been observed on Part 1 of RSR2015 database.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Novel hybrid DNN approaches for speaker verification in emotional and stressful talking environments

Article 22 June 2021

A Multi-featured Hybrid Model for Speaker Recognition on Multi-person Speech

Article 24 May 2019

Speaker Adaptation on Myanmar Spontaneous Speech Recognition

References

H. Ali, S.N. Tran, E. Benetos, A.S.D.A. Garcez, Speaker recognition with hybrid features from a deep belief network. Neural Comput. Appl. 29(6), 13–19 (2018)
Article Google Scholar
O. Buyuk, Telephone-based text-dependent speaker verification. Ph.D. Thesis (2011)
L. Chen, Y. Zhao, S.X. Zhang, J. Li, G. Ye, F. Soong, Exploring sequential characteristics in speaker bottleneck feature for text-dependent speaker verification, in Acoustics, Speech and Signal Processing (ICASSP) (2017)
N. Chen, Y. Qian, K. Yu, Multi-task learning for text-dependent speaker verification, in International Speech Communication Association (Interspeech) (2015)
S. Dey, S. Madikeri, M. Ferras, P. Motlicek, Deep neural network based posteriors for text-dependent speaker verification, in Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5050–5054
S. Dey, P. Motlicek, S. Madikeri, M. Ferras, Template-matching for text-dependent speaker verification. Speech Commun. 88, 96–105 (2017)
Article Google Scholar
T. Fu, Y. Qian, Y. Liu, K. Yu, Tandem deep features for text-dependent speaker verification, in International Speech Communication Association (Interspeech) (2014)
C. Hanilçi, H. Çeliktaş, Turkish text-dependent speaker verification using i-vector/PLDA approach, in 26th Signal Processing and Communications Applications Conference (SIU) (IEEE, 2018)
G. Heigold, I. Moreno, S. Bengio, N. Shazeer, End-to-end text-dependent speaker verification, in Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5115–5119
G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
P. Kenny, T. Stafylakis, J. Alam, P. Ouellet, M. Kockmann, Joint factor analysis for text-dependent speaker verification, in Proceedings of Odyssey Workshop (2014), pp. 1–8
T. Kinnunen, Designing a speaker-discriminative adaptive filter bank for speaker recognition, in International Conference on Spoken Language Processing (2002)
M. Längkvist, L. Karlsson, A. Loutfi, A review of unsupervised feature learning and deep learning for time-series modeling. Pattern Recognit. Lett. 42, 11–24 (2014)
Article Google Scholar
A. Larcher, J.F. Bonastre, J.S. Mason, Reinforced temporal structure information for embedded utterance-based speaker recognition, in International Speech and Communication Association (Interspeech) (2008), pp. 371–374
A. Larcher, K.A. Lee, B. Ma, H. Li, Modelling the alternative hypothesis for text-dependent speaker verification, in Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 734–738
A. Larcher, K.A. Lee, B. Ma, H. Li, Text-dependent speaker verification: classifiers, databases and RSR2015. Speech Commun. 60, 56–77 (2014)
Article Google Scholar
R.P. Lippmann, Speech recognition by machines and humans. Speech Commun. 22(1), 1–15 (1997)
Article Google Scholar
Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang, K. Yu, Deep feature for text-dependent speaker verification. Speech Commun. 73, 1–13 (2015)
Article Google Scholar
National Institute of Standards and Technology, Speaker recognition evaluation. http://www.nist.gov/speech/tests/spk. Accessed Sept 2015
T.N. Sainath, B. Kingsbury, B. Ramabhadran, Improving training time of deep belief networks through hybrid pre-training and larger batch sizes, in Proceedings of NIPS Workshop on Log-Linear Models (2012)
M. Sheikhan, D. Gharavian, F. Ashoftedel, Using DTW neural-based MFCC warping to improve emotional speech recognition. Neural Comput. Appl. 21(7), 1765–1773 (2012)
Article Google Scholar
D. Snyder, SRE16 Xvector Model 1a. http://kaldi-asr.org/models.html. Accessed Dec 2018
D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, Deep neural network embeddings for text-independent speaker verification, in Proceedings of Interspeech (2017), pp. 999–1003
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-vectors: robust DNN embeddings for speaker recognition, in ICASSP (2018) (Submitted)
T. Stafylakis, P. Kenny, P. Ouellet, J. Perez, M. Kockmann, P. Dumouchel, Text-dependent speaker recognition using PLDA with uncertainty propagation, in Matrix, vol. 500 (2013)
E. Variani, X. Lei, E. McDermott, I. Lopez-Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification, in Acoustics, Speech and Signal Processing (ICASSP), vol. 14 (2014), pp. 4052–4056
Y. Xu, I. McLoughlin, Y. Song, K. Wu, Improved i-vector representation for speaker diarization. Circuits Syst. Signal Process. 35(9), 3393–3404 (2016)
Article MathSciNet Google Scholar
S.J. Young, S. Young, The HTK Hidden Markov Model Toolkit: Design and Philosophy, vol. 28 (University of Cambridge, Department of Engineering, Cambridge, 1993)
Google Scholar
H. Zeinali, H. Sameti, L. Burget, HMM-based phrase-independent i-vector extractor for text-dependent speaker verification. IEEE/ACM Trans. Audio Speech Lang. Process. 25(7), 1421–1435 (2017)
Article Google Scholar
H. Zeinali, H. Sameti, L. Burget, Text-dependent speaker verification based on i-vectors, neural networks and hidden Markov models. Comput. Speech Lang. 46, 53–71 (2017)
Article Google Scholar
Z. Zhou, G. Huang, H. Chen, J. Gao, Automatic radar waveform recognition based on deep convolutional denoising auto-encoders. Circuits Syst. Signal Process. 37, 4034–4048 (2018)
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors would like to thank the Speech and Image Processing Laboratory of the National Institute of Technology Silchar, Silchar, for supporting the research work.

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, National Institute of Technology Silchar, Silchar, Assam, 788 010, India
Mohammad Azharuddin Laskar & Rabul Hussain Laskar

Authors

Mohammad Azharuddin Laskar
View author publications
You can also search for this author in PubMed Google Scholar
Rabul Hussain Laskar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammad Azharuddin Laskar.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Laskar, M.A., Laskar, R.H. Integrating DNN–HMM Technique with Hierarchical Multi-layer Acoustic Model for Text-Dependent Speaker Verification. Circuits Syst Signal Process 38, 3548–3572 (2019). https://doi.org/10.1007/s00034-019-01103-3

Download citation

Received: 01 September 2018
Revised: 19 March 2019
Accepted: 27 March 2019
Published: 03 April 2019
Issue Date: 15 August 2019
DOI: https://doi.org/10.1007/s00034-019-01103-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Integrating DNN–HMM Technique with Hierarchical Multi-layer Acoustic Model for Text-Dependent Speaker Verification

Abstract

Access this article

Similar content being viewed by others

Novel hybrid DNN approaches for speaker verification in emotional and stressful talking environments

A Multi-featured Hybrid Model for Speaker Recognition on Multi-person Speech

Speaker Adaptation on Myanmar Spontaneous Speech Recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Integrating DNN–HMM Technique with Hierarchical Multi-layer Acoustic Model for Text-Dependent Speaker Verification

Abstract

Access this article

Similar content being viewed by others

Novel hybrid DNN approaches for speaker verification in emotional and stressful talking environments

A Multi-featured Hybrid Model for Speaker Recognition on Multi-person Speech

Speaker Adaptation on Myanmar Spontaneous Speech Recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation