Abstract
For the text content is known, the semantic information and speaker characteristics in the speech signal can be used for speech recognition and speaker verification respectively in text prompt speaker recognition, which solves the problem of forged recordings in the process of text association. In practical applications, by combining speech recognition and speaker recognition technologies, a double verification effect can be achieved, which also can effectively improve security. There are few studies on the combination of speaker recognition and speech recognition in Tibetan, mainly using non-end-to-end methods, and the performance of the model is not ideal. Based on the original research, this paper uses the mainstream end-to-end method to study the speaker verification part. The network model uses ResNet-34 and ResNet-50, and fine-tuned them. “Open set” speaker verification is essentially metric learning. The ideal embedding is to compress the frame-level features into a compact speech-level representation, thereby maximizing the inter-class distance and minimizing the intra-class distance. For the loss function, we use three classification objective loss functions and three metric learning objective loss functions to extensively evaluate the performance of the model. In order to further improve the performance of the model, we fused the two loss functions of Softmax and Angular Prototype. The experimental results show that the effect of Fast ResNet-50 is better than that of Fast ResNet-34, and the model effect of the Angular Prototype loss function is better than other single loss functions. The model with the fused loss function has the best performance, with an equal error rate of 4.25%.
Similar content being viewed by others
References
Abdel-Hamid O, Mohamed A, Jiang H, et al. (2012) Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition[C]. In: International conference on acoustics, speech and signal processing. IEEE, Kyoto, Japan, pp 4277–4280
Cai W, Chen J, Li M (2018) Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. In: Speaker odyssey, pp 74–81
Campbell WM, Sturim DE, Reynolds DA (2006) Support vector machines using GMM supervectors for speaker verification[J]. IEEE Signal Process Lett 13(5):308–311
Chen N, Qian Y, Kai Y (2015) Multi-task learning for text-dependent speaker verification. In: INTERSPEECH, pp 185–189
Chung JS, Huh J, Mun S (2020) Delving into VoxCeleb: Environment invariant speaker recognition. In: Speaker Odyssey, pp 349–356
Chung JS, Nagrani A, Zisserman A (2018) VoxCeleb2: Deep speaker recognition. In: INTERSPEECH, pp 1086–1090
Dehak N, Kenny PJ, Dehak R et al (2010) Front-end factor analysis for speaker verification[J]. IEEE Transactions on Audio, Speech, and Language Processing 19(4):788–798
Deng J, Guo J, Xue N, Zafeiriou S (2019) Arcface: Additive angular margin loss for deep face recognition. In: Proc CVPR, pp 4690–4699
Gan Z, Yu Y, Wang R et al (2020) CNN-based speaker verification and speech recognition in tibetan[C]. Journal of Physics: Conference Series. IOP Publishing 1693(1):012180
Graves A, Jaitly N (2014) Towards end-to-end speech recognition with recurrent neural networks[C]. In: International conference on machine learning. ACM, Being, China, pp 1764–1772
Heo HS, et al. (2020) Clova Baseline System for the VoxCeleb Speaker Recognition Challenge 2020. arXiv preprint arXiv:http://arxiv.org/abs/2009.14153
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets[J]. Neural Comput 18(7):1527–1554
Kenny P, Boulianne G, Ouellet P, et al. (2007) Joint factor analysis versus eigenchannels in speaker recognition[J]. IEEE Transactions on Audio, Speech, and Language Processing 15(4):1435–1447
Li C, Ma X, Jiang B et al (2017) Deep speaker:An end-to-end neural speaker Embedding system, arXiv preprint arXiv:http://arxiv.org/abs/1705.02304
Liu W, Wen Y, Yu Z, Li M, Raj B, Song L (2017) Sphereface: Deep hypersphere embedding for face recognition. In: Proc CVPR, pp 212–220
Luo R, Cai W, Chen M et al (2012) An improved Particle Swarm Optimization algorithm for speaker recognition[C]. In: 2012 IEEE Fifth international conference on advanced computational intelligence (ICACI), IEEE, pp 641–644
Nagrani A, Chung JS, Zisserman A (2017) VoxCeleb: A largescale speaker identification dataset. In: INTERSPEECH, pp 2616–2620
Okabe K, Koshinaka T, Shinoda K (2018) Attentive statistics pooling for deep speaker embedding. In: INTERSPEECH, pp 2252–2256
Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models[J]. Digital Signal Process 10(1/3):19–41
Snell J, Swersky K, Zemel R (2017) Prototypical networks for few-shot learning. In: NIPS, pp 4077–4087
Snyder D, Garcia-Romero D, Povey D, Khudanpur S (2017) Deep neural network embeddings for text-independent speaker verification. In: INTERSPEECH, pp 999–1003
Snyder D, Garcia-Romero D, Sell G, Povey D, Khudanpur S (2018) X-vectors: Robust dnn embeddings for speaker recognition. In: Proc ICASSP. IEEE, pp 5329–5333
Son Chung J, Huh J, Mun S, Lee M, Heo HS, Choe S, Ham C, Jung S, Lee B-J, Han I (2020) In defence of metric learning for speaker recognition. In: INTERSPEECH, pp 2977–2981
Variani E, et al. (2014) Deep neural networks for small footprint text-dependent speaker verification. In: IEEE international conference on acoustics. IEEE, pp 4052–4056
Waibel A, Hanazawa T, Hinton G et al (1989) Phoneme recognition using time-delay neural networks[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing 37(3):328–339
Wan L, Wang Q, Papir A, Moreno IL (2018) Generalized endto-end loss for speaker verification. In: Proc ICASSP. IEEE, pp 4879–4883
Wang F, Cheng J, Liu W, Liu H (2018) Additive margin softmax for face verification. IEEE Signal Process Lett 25(7):926–930
Wang J, Zhou F, Wen S, Liu X, Lin Y (2017) Deep metric learning with angular loss. In: Proc ICCV, pp 2593–2601
Zhang C, Koishida K (2017) End-to-end text-independent speaker verification with triplet loss on short utterances[C]. In: Proceedings of the INTERSPEECH, pp 1487–1491
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors have no confict of interest in any material discussed in this article.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Gan, Zy., Yu, Y. & Luo, M. A tibetan-dependent speaker recognition method based on deep learning. Multimed Tools Appl 81, 30821–30840 (2022). https://doi.org/10.1007/s11042-022-12540-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12540-9