Abstract
Subspace techniques, such as i-vector/probabilistic linear discriminant analysis and joint factor analysis, have been the most commonly used techniques in the field of text-dependent speaker verification. These techniques, however, do not model the temporal structure of the pass-phrase which otherwise is an important cue in the context of text-dependent speaker verification. The hierarchical multi-layer acoustic model (HiLAM) uses Gaussian mixture model (GMM)—hidden Markov model (HMM) technique, which also accounts for the temporal information of the pass-phrase. Owing to its contextual information modeling, HiLAM has been found to outperform the subspace techniques. In this paper, we propose integrating DNN–HMM technique with HiLAM to further improve the system performance. Firstly, an attempt has been made to define a speaker-text unit/class that could characterize the speaker idiosyncrasies, which are known to be associated with shorter and more fundamental units of speech text. To this end, HiLAM is used to propose a new class definition, and the training data is aligned with respect to this class definition. The labeled data is then used to discriminatively train a deep neural network (DNN). The new method of alignment enables the neural network to learn the actual context of the pass-phrase components. This is not the case with DNN trained in automatic speech recognition fashion. Besides, the network also models the speaker idiosyncrasies associated with specific and finer text units. The use of DNN posteriors to replace the GMM likelihood probabilities of HiLAM has led to significant improvement in performance over the baseline HiLAM system. Relative EER reduction of up to 36.58% has been observed on Part 1 of RSR2015 database.
Similar content being viewed by others
References
H. Ali, S.N. Tran, E. Benetos, A.S.D.A. Garcez, Speaker recognition with hybrid features from a deep belief network. Neural Comput. Appl. 29(6), 13–19 (2018)
O. Buyuk, Telephone-based text-dependent speaker verification. Ph.D. Thesis (2011)
L. Chen, Y. Zhao, S.X. Zhang, J. Li, G. Ye, F. Soong, Exploring sequential characteristics in speaker bottleneck feature for text-dependent speaker verification, in Acoustics, Speech and Signal Processing (ICASSP) (2017)
N. Chen, Y. Qian, K. Yu, Multi-task learning for text-dependent speaker verification, in International Speech Communication Association (Interspeech) (2015)
S. Dey, S. Madikeri, M. Ferras, P. Motlicek, Deep neural network based posteriors for text-dependent speaker verification, in Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5050–5054
S. Dey, P. Motlicek, S. Madikeri, M. Ferras, Template-matching for text-dependent speaker verification. Speech Commun. 88, 96–105 (2017)
T. Fu, Y. Qian, Y. Liu, K. Yu, Tandem deep features for text-dependent speaker verification, in International Speech Communication Association (Interspeech) (2014)
C. Hanilçi, H. Çeliktaş, Turkish text-dependent speaker verification using i-vector/PLDA approach, in 26th Signal Processing and Communications Applications Conference (SIU) (IEEE, 2018)
G. Heigold, I. Moreno, S. Bengio, N. Shazeer, End-to-end text-dependent speaker verification, in Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5115–5119
G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
P. Kenny, T. Stafylakis, J. Alam, P. Ouellet, M. Kockmann, Joint factor analysis for text-dependent speaker verification, in Proceedings of Odyssey Workshop (2014), pp. 1–8
T. Kinnunen, Designing a speaker-discriminative adaptive filter bank for speaker recognition, in International Conference on Spoken Language Processing (2002)
M. Längkvist, L. Karlsson, A. Loutfi, A review of unsupervised feature learning and deep learning for time-series modeling. Pattern Recognit. Lett. 42, 11–24 (2014)
A. Larcher, J.F. Bonastre, J.S. Mason, Reinforced temporal structure information for embedded utterance-based speaker recognition, in International Speech and Communication Association (Interspeech) (2008), pp. 371–374
A. Larcher, K.A. Lee, B. Ma, H. Li, Modelling the alternative hypothesis for text-dependent speaker verification, in Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 734–738
A. Larcher, K.A. Lee, B. Ma, H. Li, Text-dependent speaker verification: classifiers, databases and RSR2015. Speech Commun. 60, 56–77 (2014)
R.P. Lippmann, Speech recognition by machines and humans. Speech Commun. 22(1), 1–15 (1997)
Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang, K. Yu, Deep feature for text-dependent speaker verification. Speech Commun. 73, 1–13 (2015)
National Institute of Standards and Technology, Speaker recognition evaluation. http://www.nist.gov/speech/tests/spk. Accessed Sept 2015
T.N. Sainath, B. Kingsbury, B. Ramabhadran, Improving training time of deep belief networks through hybrid pre-training and larger batch sizes, in Proceedings of NIPS Workshop on Log-Linear Models (2012)
M. Sheikhan, D. Gharavian, F. Ashoftedel, Using DTW neural-based MFCC warping to improve emotional speech recognition. Neural Comput. Appl. 21(7), 1765–1773 (2012)
D. Snyder, SRE16 Xvector Model 1a. http://kaldi-asr.org/models.html. Accessed Dec 2018
D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, Deep neural network embeddings for text-independent speaker verification, in Proceedings of Interspeech (2017), pp. 999–1003
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-vectors: robust DNN embeddings for speaker recognition, in ICASSP (2018) (Submitted)
T. Stafylakis, P. Kenny, P. Ouellet, J. Perez, M. Kockmann, P. Dumouchel, Text-dependent speaker recognition using PLDA with uncertainty propagation, in Matrix, vol. 500 (2013)
E. Variani, X. Lei, E. McDermott, I. Lopez-Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification, in Acoustics, Speech and Signal Processing (ICASSP), vol. 14 (2014), pp. 4052–4056
Y. Xu, I. McLoughlin, Y. Song, K. Wu, Improved i-vector representation for speaker diarization. Circuits Syst. Signal Process. 35(9), 3393–3404 (2016)
S.J. Young, S. Young, The HTK Hidden Markov Model Toolkit: Design and Philosophy, vol. 28 (University of Cambridge, Department of Engineering, Cambridge, 1993)
H. Zeinali, H. Sameti, L. Burget, HMM-based phrase-independent i-vector extractor for text-dependent speaker verification. IEEE/ACM Trans. Audio Speech Lang. Process. 25(7), 1421–1435 (2017)
H. Zeinali, H. Sameti, L. Burget, Text-dependent speaker verification based on i-vectors, neural networks and hidden Markov models. Comput. Speech Lang. 46, 53–71 (2017)
Z. Zhou, G. Huang, H. Chen, J. Gao, Automatic radar waveform recognition based on deep convolutional denoising auto-encoders. Circuits Syst. Signal Process. 37, 4034–4048 (2018)
Acknowledgements
The authors would like to thank the Speech and Image Processing Laboratory of the National Institute of Technology Silchar, Silchar, for supporting the research work.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Laskar, M.A., Laskar, R.H. Integrating DNN–HMM Technique with Hierarchical Multi-layer Acoustic Model for Text-Dependent Speaker Verification. Circuits Syst Signal Process 38, 3548–3572 (2019). https://doi.org/10.1007/s00034-019-01103-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-019-01103-3