A tibetan-dependent speaker recognition method based on deep learning

Gan, Zhen-ye; Yu, Yue; Luo, Min

doi:10.1007/s11042-022-12540-9

A tibetan-dependent speaker recognition method based on deep learning

Published: 07 April 2022

Volume 81, pages 30821–30840, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

272 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

For the text content is known, the semantic information and speaker characteristics in the speech signal can be used for speech recognition and speaker verification respectively in text prompt speaker recognition, which solves the problem of forged recordings in the process of text association. In practical applications, by combining speech recognition and speaker recognition technologies, a double verification effect can be achieved, which also can effectively improve security. There are few studies on the combination of speaker recognition and speech recognition in Tibetan, mainly using non-end-to-end methods, and the performance of the model is not ideal. Based on the original research, this paper uses the mainstream end-to-end method to study the speaker verification part. The network model uses ResNet-34 and ResNet-50, and fine-tuned them. “Open set” speaker verification is essentially metric learning. The ideal embedding is to compress the frame-level features into a compact speech-level representation, thereby maximizing the inter-class distance and minimizing the intra-class distance. For the loss function, we use three classification objective loss functions and three metric learning objective loss functions to extensively evaluate the performance of the model. In order to further improve the performance of the model, we fused the two loss functions of Softmax and Angular Prototype. The experimental results show that the effect of Fast ResNet-50 is better than that of Fast ResNet-34, and the model effect of the Angular Prototype loss function is better than other single loss functions. The model with the fused loss function has the best performance, with an equal error rate of 4.25%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Fig. 7

A deep learning approach for speaker recognition

Article 18 December 2019

Text-independent speaker recognition based on adaptive course learning loss and deep residual network

Article Open access 23 July 2021

Weighted X-Vectors for Robust Text-Independent Speaker Verification with Multiple Enrollment Utterances

Article 16 January 2022

References

Abdel-Hamid O, Mohamed A, Jiang H, et al. (2012) Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition[C]. In: International conference on acoustics, speech and signal processing. IEEE, Kyoto, Japan, pp 4277–4280
Google Scholar
Cai W, Chen J, Li M (2018) Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. In: Speaker odyssey, pp 74–81
Campbell WM, Sturim DE, Reynolds DA (2006) Support vector machines using GMM supervectors for speaker verification[J]. IEEE Signal Process Lett 13(5):308–311
Article Google Scholar
Chen N, Qian Y, Kai Y (2015) Multi-task learning for text-dependent speaker verification. In: INTERSPEECH, pp 185–189
Chung JS, Huh J, Mun S (2020) Delving into VoxCeleb: Environment invariant speaker recognition. In: Speaker Odyssey, pp 349–356
Chung JS, Nagrani A, Zisserman A (2018) VoxCeleb2: Deep speaker recognition. In: INTERSPEECH, pp 1086–1090
Dehak N, Kenny PJ, Dehak R et al (2010) Front-end factor analysis for speaker verification[J]. IEEE Transactions on Audio, Speech, and Language Processing 19(4):788–798
Article Google Scholar
Deng J, Guo J, Xue N, Zafeiriou S (2019) Arcface: Additive angular margin loss for deep face recognition. In: Proc CVPR, pp 4690–4699
Google Scholar
Gan Z, Yu Y, Wang R et al (2020) CNN-based speaker verification and speech recognition in tibetan[C]. Journal of Physics: Conference Series. IOP Publishing 1693(1):012180
Google Scholar
Graves A, Jaitly N (2014) Towards end-to-end speech recognition with recurrent neural networks[C]. In: International conference on machine learning. ACM, Being, China, pp 1764–1772
Google Scholar
Heo HS, et al. (2020) Clova Baseline System for the VoxCeleb Speaker Recognition Challenge 2020. arXiv preprint arXiv:http://arxiv.org/abs/2009.14153
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets[J]. Neural Comput 18(7):1527–1554
Article MathSciNet Google Scholar
Kenny P, Boulianne G, Ouellet P, et al. (2007) Joint factor analysis versus eigenchannels in speaker recognition[J]. IEEE Transactions on Audio, Speech, and Language Processing 15(4):1435–1447
Article Google Scholar
Li C, Ma X, Jiang B et al (2017) Deep speaker:An end-to-end neural speaker Embedding system, arXiv preprint arXiv:http://arxiv.org/abs/1705.02304
Liu W, Wen Y, Yu Z, Li M, Raj B, Song L (2017) Sphereface: Deep hypersphere embedding for face recognition. In: Proc CVPR, pp 212–220
Google Scholar
Luo R, Cai W, Chen M et al (2012) An improved Particle Swarm Optimization algorithm for speaker recognition[C]. In: 2012 IEEE Fifth international conference on advanced computational intelligence (ICACI), IEEE, pp 641–644
Nagrani A, Chung JS, Zisserman A (2017) VoxCeleb: A largescale speaker identification dataset. In: INTERSPEECH, pp 2616–2620
Okabe K, Koshinaka T, Shinoda K (2018) Attentive statistics pooling for deep speaker embedding. In: INTERSPEECH, pp 2252–2256
Chapter Google Scholar
Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models[J]. Digital Signal Process 10(1/3):19–41
Article Google Scholar
Snell J, Swersky K, Zemel R (2017) Prototypical networks for few-shot learning. In: NIPS, pp 4077–4087
Snyder D, Garcia-Romero D, Povey D, Khudanpur S (2017) Deep neural network embeddings for text-independent speaker verification. In: INTERSPEECH, pp 999–1003
Snyder D, Garcia-Romero D, Sell G, Povey D, Khudanpur S (2018) X-vectors: Robust dnn embeddings for speaker recognition. In: Proc ICASSP. IEEE, pp 5329–5333
Google Scholar
Son Chung J, Huh J, Mun S, Lee M, Heo HS, Choe S, Ham C, Jung S, Lee B-J, Han I (2020) In defence of metric learning for speaker recognition. In: INTERSPEECH, pp 2977–2981
Variani E, et al. (2014) Deep neural networks for small footprint text-dependent speaker verification. In: IEEE international conference on acoustics. IEEE, pp 4052–4056
Google Scholar
Waibel A, Hanazawa T, Hinton G et al (1989) Phoneme recognition using time-delay neural networks[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing 37(3):328–339
Article Google Scholar
Wan L, Wang Q, Papir A, Moreno IL (2018) Generalized endto-end loss for speaker verification. In: Proc ICASSP. IEEE, pp 4879–4883
Google Scholar
Wang F, Cheng J, Liu W, Liu H (2018) Additive margin softmax for face verification. IEEE Signal Process Lett 25(7):926–930
Article Google Scholar
Wang J, Zhou F, Wen S, Liu X, Lin Y (2017) Deep metric learning with angular loss. In: Proc ICCV, pp 2593–2601
Google Scholar
Zhang C, Koishida K (2017) End-to-end text-independent speaker verification with triplet loss on short utterances[C]. In: Proceedings of the INTERSPEECH, pp 1487–1491

Download references

Author information

Authors and Affiliations

College of Physics and Electronic Engineering, Northwest Normal University, Lanzhou, 730070, China
Zhen-ye Gan, Yue Yu & Min Luo
Engineering Research Center of Gansu Province for Intelligent Information Technology and Application, Northwest Normal University, Lanzhou, 730070, China
Zhen-ye Gan & Yue Yu

Authors

Zhen-ye Gan
View author publications
You can also search for this author in PubMed Google Scholar
Yue Yu
View author publications
You can also search for this author in PubMed Google Scholar
Min Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhen-ye Gan.

Ethics declarations

Conflict of Interests

The authors have no confict of interest in any material discussed in this article.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gan, Zy., Yu, Y. & Luo, M. A tibetan-dependent speaker recognition method based on deep learning. Multimed Tools Appl 81, 30821–30840 (2022). https://doi.org/10.1007/s11042-022-12540-9

Download citation

Received: 29 April 2021
Revised: 15 January 2022
Accepted: 31 January 2022
Published: 07 April 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s11042-022-12540-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A tibetan-dependent speaker recognition method based on deep learning

Abstract

Access this article

Similar content being viewed by others

A deep learning approach for speaker recognition

Text-independent speaker recognition based on adaptive course learning loss and deep residual network

Weighted X-Vectors for Robust Text-Independent Speaker Verification with Multiple Enrollment Utterances

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A tibetan-dependent speaker recognition method based on deep learning

Abstract

Access this article

Similar content being viewed by others

A deep learning approach for speaker recognition

Text-independent speaker recognition based on adaptive course learning loss and deep residual network

Weighted X-Vectors for Robust Text-Independent Speaker Verification with Multiple Enrollment Utterances

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation