Score Normalization of X-Vector Speaker Verification System for Short-Duration Speaker Verification Challenge

Rakhmanenko, Ivan; Kostyuchenko, Evgeny; Choynzonov, Evgeny; Balatskaya, Lidiya; Shelupanov, Alexander

doi:10.1007/978-3-030-60276-5_44

Score Normalization of X-Vector Speaker Verification System for Short-Duration Speaker Verification Challenge

Conference paper
First Online: 29 September 2020

1626 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12335))

Abstract

In this paper we present our contribution to the task 2 of the short-duration speaker verification (SdSV) challenge. The main task for this challenge is to find new technologies for text-dependent and text-independent speaker verification in short duration scenario. Some of the approaches used by the authors during participation in the challenge are presented. Described speaker verification systems include baseline x-vector system with PLDA backend and score normalization, x-vector system with neural PLDA backend and fusion of both systems.

The main goal of this paper is to analyze influence of different score normalization methods on x-vector based speaker verification systems performance. We found that system with PLDA backend and ZT-normalization method (single system) gives superior performance in Farsi trials, but gives lower performance improvement in English trials. Overall, in terms of minDCF single system performs 46.3% better than baseline x-vector system. We found that enroll data augmentation is useless for Neural PLDA backend, as performance of the system does not improve after adding augmented enroll data. Single system with ZT-score normalization and additional enroll audio augmentation performs 14.8% better than Neural PLDA backend system.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Zeinali, H., Lee, K.A., Alam, J., Burget L.: Short-duration Speaker Verification (SdSV) Challenge 2020: The Challenge Evaluation Plan. arXiv preprint https://arxiv.org/abs/1912.06311 (2019)
Jung, J.W., Heo, H.S., Kim, J.H., Shim, H.J., Yu, H.J.: RawNet: advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. In: Proceedings Interspeech 2019, pp. 1268–1272 (2019)
Google Scholar
Yun, S., Cho, J., Eum, J., Chang, W., Hwang, K.: An end-to-end text-independent speaker verification framework with a keyword adversarial network. In: Proceedings Interspeech 2019, pp. 2923–2927 (2019)
Google Scholar
Li, C., et al.: Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint https://arxiv.org/abs/1705.02304 (2017)
Xie, W., Nagrani, A., Chung, J.S., Zisserman, A.: Utterance-level aggregation for speaker recognition in the wild. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5791–5795. IEEE (2019)
Google Scholar
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2010)
Article Google Scholar
Rohdin, J., et al.: End-to-end DNN based speaker recognition inspired by i-vector and PLDA. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4874–4878 (2018)
Google Scholar
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: rRobust DNN embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333 (2018)
Google Scholar
Prince, S.J., Elder, J.H.: Probabilistic linear discriminant analysis for inferences about identity. In: 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8 (2007)
Google Scholar
Garcia-Romero, D., et al.: X-vector DNN refinement with full-length recordings for speaker recognition. In: Proceedings Interspeech 2019, pp. 1493–1496 (2019)
Google Scholar
Ramoji, S., Krishnan, P., Ganapathy, S.: NPLDA: a deep neural PLDA model for speaker verification. In: Proceedings Odyssey 2020 The Speaker and Language Recognition Workshop, pp. 202–209 (2020)
Google Scholar
Povey, D., Ghoshal, A., Boulianne, G., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society (2011)
Google Scholar
Snyder, D., et al.: Speaker recognition for multi-speaker conversations using x-vectors. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5796–5800. IEEE (2019)
Google Scholar
Barras, C., Gauvain, J.L.: Feature and score normalization for speaker verification of cellular data. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2003, vol. 2, pp. 49–52. IEEE (2003)
Google Scholar
Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. arXiv preprint https://arxiv.org/abs/1706.08612 (2017)
Zeinali, H., Burget, L., Černocký, J.: A multi purpose and large scale speech corpus in Persian and English for speaker and speech recognition: the DeepMine database. arXiv preprint https://arxiv.org/abs/1912.03627 (2019)
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: LibriSpeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. IEEE (2015)
Google Scholar
Thienpondt, J., Desplanques, B., Demuynck, K.: Cross-lingual speaker verification with domain-balanced hard prototype mining and language-dependent score normalization. arXiv preprint https://arxiv.org/abs/2007.07689 (2020)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141. IEEE (2018)
Google Scholar
Gao, S.: Res2Net: a new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. (2019)
Google Scholar
Gao, Z., et al.: Improving aggregation and loss function for better embedding learning in end-to-end speaker verification system. In: Proceedings Interspeech 2019, pp. 361–365 (2019)
Google Scholar
Thienpondt, J., Desplanques, B., Demuynck, K.: ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. arXiv preprint https://arxiv.org/abs/2005.07143 (2020)
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699. IEEE (2019)
Google Scholar

Download references

Acknowledgements

The study was performed by a grant from the Russian Science Foundation (project 16-15-00038).

Author information

Authors and Affiliations

Tomsk State University of Control Systems and Radioelectronics, Lenina str. 40, 634050, Tomsk, Russia
Ivan Rakhmanenko, Evgeny Kostyuchenko, Evgeny Choynzonov, Lidiya Balatskaya & Alexander Shelupanov
Tomsk Cancer Research Institute, Kooperativniy av. 5, 634050, Tomsk, Russia
Lidiya Balatskaya

Authors

Ivan Rakhmanenko
View author publications
You can also search for this author in PubMed Google Scholar
Evgeny Kostyuchenko
View author publications
You can also search for this author in PubMed Google Scholar
Evgeny Choynzonov
View author publications
You can also search for this author in PubMed Google Scholar
Lidiya Balatskaya
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Shelupanov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ivan Rakhmanenko .

Editor information

Editors and Affiliations

St. Petersburg Institute for Informatics and Automation, Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Institute for Applied and Mathematical Linguistics, Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rakhmanenko, I., Kostyuchenko, E., Choynzonov, E., Balatskaya, L., Shelupanov, A. (2020). Score Normalization of X-Vector Speaker Verification System for Short-Duration Speaker Verification Challenge. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science(), vol 12335. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_44

Download citation

DOI: https://doi.org/10.1007/978-3-030-60276-5_44
Published: 29 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60275-8
Online ISBN: 978-3-030-60276-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics