Skip to main content
Log in

Text-independent speaker identification based on deep Gaussian correlation supervector

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Great progress has been made in speaker recognition by extracting features from Gaussian mixture model (GMM) or deep neural network (DNN). In this paper, to extract the personality characteristics of speakers more accurately, we propose a novel deep Gaussian correlation supervector (DGCS) feature based on a DBN-GMM hybrid model. In the method, we firstly extract MFCC from preprocessed speech signals and employ a DBN to gain bottleneck features. Then bottleneck features are fed to a GMM to extract deep Gaussian supervector (DGS) which can be as the input of SVM achieving pattern discrimination and judgment. Further considering the relevance between deep mean vectors of DGS, DGS will be transformed to DGCS by the method of supervector recombination. Our experiments show that utilizing DGCS can significantly improve recognition rate by 17.979% compared to the system only with supervector, 18.22% compared to the system with DGS and 1.875% compared to the system with correlation supervector. In addition, the proposed DGCS demonstrates that time complexity for identification task can be largely reduced.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Bahmaninezhad, F. (2017). I-vector/PLDA speaker recognition using support vectors with discriminant analysis. In ICASSP, pp. 5410–5414.

  • Bansal, P., Bharti, R. (2015). Speaker recognition using MFCC, shifted MFCC with vector quantization and fuzz. In International conference on soft computing techniques and implementations, pp. 41–45.

  • Bhattacharya, G., Kenny, P., Alam, J., & Stafylakis, T. (2016). Deep neural network based text-dependent speaker verification: preliminary results in Odyssey 2016: The Speaker and Language Recognition Workshop, Bilbao, June 21–24, pp. 9–15.

  • Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167.

    Article  Google Scholar 

  • Ciregan, D., Meier, U., Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. In IEEE conference on computer vision and pattern recognition, pp. 3642–3649.

  • Dehak, N., Kenny, P. J., & Dehak, R. (2011). Front- end factor analysis for speaker verification. IEEE Transactions on Audio, Speech and Language Processing, 19(4), 788–798.

    Article  Google Scholar 

  • Giusti, A., Ciresan, D. C., Masci, J. (2013). Fast image scanning with deep max-pooling convolutional neural networks. In IEEE international conference on image processing, pp. 4034–4038.

  • Hazrat, A., & Son, N. (2018). Speaker recognition with hybrid features from a deep belief network. Neural Comput & Applic, 29, 13–19.

    Article  Google Scholar 

  • He, T. X., Fan, Y. C., Qian, Y. M., Tan, T., & Yu, K. (2014). Reshaping deep neural network for fast decoding by node-pruning. In Acoustics, speech and signal processing (ICASSP), pp. 245–224.

  • Hinton, G. E. (2012). A practical guide to training restricted Boltzmann machines in Neural Networks: Tricks of the Trade (Lecture Notes in Computer Science, 7700). Berlin: Springer, pp. 599–619

  • Liu, Z. L., Wu, Z. D., & Li, T. (2018). GMM and CNN hybrid method for short utterance speaker recognition. IEEE Transactions on Industrial informatics, 14(7), 3244–3252.

    Article  Google Scholar 

  • Ma, B., Li, H. Z., Larcher, A. and Lee, K. A. (2014).Text-dependent speaker verification: classifiers, databases and RSR2015. In Speech Communication, pp. 56–77.

  • Makrem, B., Imen, J. (2016). Study of speaker recognition system based on feed forward deep neural networks exploring text-dependent mode. In International conference on science of electronics, technologies of information and telecommunications, pp. 355–360.

  • Matejka, P., Zhang, L., Ng, T., Mallidi, H. S., Glembek, O., Ma, J., and Zhang, B. (2014). Neural network bottleneck features for language identification. In Proceedings of IEEE Odyssey, pp. 299–304.

  • Mohamed, A., Dahl, G., Sainath, T. N. (2011). Deep belief networks using discriminative features for phone recognition. In IEEE international conference on acoustics, speech and signal processing, pp. 5060–5063.

  • Sarkar, A. K., Do, C. T., & Le, V. B. (2014). Combination of cepstral and phonetically discriminative features for speaker verification. IEEE Signal Processing Letters, 21(9), 1040–1044.

    Article  Google Scholar 

  • Sergey, N., Timur, P., Andrey, S. and Alexey, S. (2014).Text-dependent GMM-GFA system for password based speaker verification. In IEEE ICASSP, pp. 729–737.

  • Song, Y., Jiang, B., & Bao, Y. (2013). I-vector representation based on bottleneck features for language identification. Electronics Letters, 49(24), 1569–1570.

    Article  Google Scholar 

  • Tian, G. (2011). Hybrid genetic and variational expectation-maximization algorithm for Gaussian-mixture-model-based brain MR image segmentation. IEEE Transactions on Information Technology in Biomedicine, 15(3), 373–380.

    Article  Google Scholar 

  • Yamada, T., Wang, L., and Kai, A. (2013). Improvement of distant-talking speaker identification using bottleneck features of DNN. In Proc. Interspeech, pp. 3661–3664.

  • Yu, H., Tan, Z. H., & Ma, Z. Y. (2018). Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features. IEEE Transactions on Neural Networks and Learning Systems, 29(10), 1–12.

    Article  Google Scholar 

  • Zhang, S. X., & Mak, M. W. (2011). Optimized discriminative kernel for SVM scoring and its application to speaker verification. IEEE Transactions on Neural Networks, 22(2), 173–185.

    Article  Google Scholar 

  • Zhang, S. X., Chen, Z., Zhao, Y. (2016). End-to-end attention based text-dependent speaker verification. IEEE Spoken Language Technology Workshop, pp. 171–178.

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (61671252, 61501251), the Natural Science Foundation of Jiangsu Province (BK20140891).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Linhui Sun.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, L., Gu, T., Xie, K. et al. Text-independent speaker identification based on deep Gaussian correlation supervector. Int J Speech Technol 22, 449–457 (2019). https://doi.org/10.1007/s10772-019-09618-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-019-09618-5

Keywords

Navigation