Skip to main content
Log in

A tibetan-dependent speaker recognition method based on deep learning

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

For the text content is known, the semantic information and speaker characteristics in the speech signal can be used for speech recognition and speaker verification respectively in text prompt speaker recognition, which solves the problem of forged recordings in the process of text association. In practical applications, by combining speech recognition and speaker recognition technologies, a double verification effect can be achieved, which also can effectively improve security. There are few studies on the combination of speaker recognition and speech recognition in Tibetan, mainly using non-end-to-end methods, and the performance of the model is not ideal. Based on the original research, this paper uses the mainstream end-to-end method to study the speaker verification part. The network model uses ResNet-34 and ResNet-50, and fine-tuned them. “Open set” speaker verification is essentially metric learning. The ideal embedding is to compress the frame-level features into a compact speech-level representation, thereby maximizing the inter-class distance and minimizing the intra-class distance. For the loss function, we use three classification objective loss functions and three metric learning objective loss functions to extensively evaluate the performance of the model. In order to further improve the performance of the model, we fused the two loss functions of Softmax and Angular Prototype. The experimental results show that the effect of Fast ResNet-50 is better than that of Fast ResNet-34, and the model effect of the Angular Prototype loss function is better than other single loss functions. The model with the fused loss function has the best performance, with an equal error rate of 4.25%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

References

  1. Abdel-Hamid O, Mohamed A, Jiang H, et al. (2012) Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition[C]. In: International conference on acoustics, speech and signal processing. IEEE, Kyoto, Japan, pp 4277–4280

    Google Scholar 

  2. Cai W, Chen J, Li M (2018) Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. In: Speaker odyssey, pp 74–81

  3. Campbell WM, Sturim DE, Reynolds DA (2006) Support vector machines using GMM supervectors for speaker verification[J]. IEEE Signal Process Lett 13(5):308–311

    Article  Google Scholar 

  4. Chen N, Qian Y, Kai Y (2015) Multi-task learning for text-dependent speaker verification. In: INTERSPEECH, pp 185–189

  5. Chung JS, Huh J, Mun S (2020) Delving into VoxCeleb: Environment invariant speaker recognition. In: Speaker Odyssey, pp 349–356

  6. Chung JS, Nagrani A, Zisserman A (2018) VoxCeleb2: Deep speaker recognition. In: INTERSPEECH, pp 1086–1090

  7. Dehak N, Kenny PJ, Dehak R et al (2010) Front-end factor analysis for speaker verification[J]. IEEE Transactions on Audio, Speech, and Language Processing 19(4):788–798

    Article  Google Scholar 

  8. Deng J, Guo J, Xue N, Zafeiriou S (2019) Arcface: Additive angular margin loss for deep face recognition. In: Proc CVPR, pp 4690–4699

    Google Scholar 

  9. Gan Z, Yu Y, Wang R et al (2020) CNN-based speaker verification and speech recognition in tibetan[C]. Journal of Physics: Conference Series. IOP Publishing 1693(1):012180

    Google Scholar 

  10. Graves A, Jaitly N (2014) Towards end-to-end speech recognition with recurrent neural networks[C]. In: International conference on machine learning. ACM, Being, China, pp 1764–1772

    Google Scholar 

  11. Heo HS, et al. (2020) Clova Baseline System for the VoxCeleb Speaker Recognition Challenge 2020. arXiv preprint arXiv:http://arxiv.org/abs/2009.14153

  12. Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets[J]. Neural Comput 18(7):1527–1554

    Article  MathSciNet  Google Scholar 

  13. Kenny P, Boulianne G, Ouellet P, et al. (2007) Joint factor analysis versus eigenchannels in speaker recognition[J]. IEEE Transactions on Audio, Speech, and Language Processing 15(4):1435–1447

    Article  Google Scholar 

  14. Li C, Ma X, Jiang B et al (2017) Deep speaker:An end-to-end neural speaker Embedding system, arXiv preprint arXiv:http://arxiv.org/abs/1705.02304

  15. Liu W, Wen Y, Yu Z, Li M, Raj B, Song L (2017) Sphereface: Deep hypersphere embedding for face recognition. In: Proc CVPR, pp 212–220

    Google Scholar 

  16. Luo R, Cai W, Chen M et al (2012) An improved Particle Swarm Optimization algorithm for speaker recognition[C]. In: 2012 IEEE Fifth international conference on advanced computational intelligence (ICACI), IEEE, pp 641–644

  17. Nagrani A, Chung JS, Zisserman A (2017) VoxCeleb: A largescale speaker identification dataset. In: INTERSPEECH, pp 2616–2620

  18. Okabe K, Koshinaka T, Shinoda K (2018) Attentive statistics pooling for deep speaker embedding. In: INTERSPEECH, pp 2252–2256

    Chapter  Google Scholar 

  19. Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models[J]. Digital Signal Process 10(1/3):19–41

    Article  Google Scholar 

  20. Snell J, Swersky K, Zemel R (2017) Prototypical networks for few-shot learning. In: NIPS, pp 4077–4087

  21. Snyder D, Garcia-Romero D, Povey D, Khudanpur S (2017) Deep neural network embeddings for text-independent speaker verification. In: INTERSPEECH, pp 999–1003

  22. Snyder D, Garcia-Romero D, Sell G, Povey D, Khudanpur S (2018) X-vectors: Robust dnn embeddings for speaker recognition. In: Proc ICASSP. IEEE, pp 5329–5333

    Google Scholar 

  23. Son Chung J, Huh J, Mun S, Lee M, Heo HS, Choe S, Ham C, Jung S, Lee B-J, Han I (2020) In defence of metric learning for speaker recognition. In: INTERSPEECH, pp 2977–2981

  24. Variani E, et al. (2014) Deep neural networks for small footprint text-dependent speaker verification. In: IEEE international conference on acoustics. IEEE, pp 4052–4056

    Google Scholar 

  25. Waibel A, Hanazawa T, Hinton G et al (1989) Phoneme recognition using time-delay neural networks[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing 37(3):328–339

    Article  Google Scholar 

  26. Wan L, Wang Q, Papir A, Moreno IL (2018) Generalized endto-end loss for speaker verification. In: Proc ICASSP. IEEE, pp 4879–4883

    Google Scholar 

  27. Wang F, Cheng J, Liu W, Liu H (2018) Additive margin softmax for face verification. IEEE Signal Process Lett 25(7):926–930

    Article  Google Scholar 

  28. Wang J, Zhou F, Wen S, Liu X, Lin Y (2017) Deep metric learning with angular loss. In: Proc ICCV, pp 2593–2601

    Google Scholar 

  29. Zhang C, Koishida K (2017) End-to-end text-independent speaker verification with triplet loss on short utterances[C]. In: Proceedings of the INTERSPEECH, pp 1487–1491

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhen-ye Gan.

Ethics declarations

Conflict of Interests

The authors have no confict of interest in any material discussed in this article.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gan, Zy., Yu, Y. & Luo, M. A tibetan-dependent speaker recognition method based on deep learning. Multimed Tools Appl 81, 30821–30840 (2022). https://doi.org/10.1007/s11042-022-12540-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12540-9

Keywords

Navigation