Abstract
Evaluation trials are crucial to measure performance of speaker verification systems. However, the design of trials that can faithfully reflect system performance and accurately distinguish between different systems remains an open issue. In this paper, we focus on a particular problem: the impact of trials that are easy to solve for the majority of systems. We show that these ‘easy trials’ not only report over-optimistic absolute performance, but also lead to biased relative performance in system comparisons when they are asymmetrically distributed. This motivated the idea of mining ‘hard trials’, i.e., trials that are regarded to be difficult by current representative techniques. Three approaches to retrieving hard trials will be reported, and the properties of the retrieved hard trials are studied, from the perspectives of both machines and humans. Finally, a novel visualization tool which we name a Config-Performance (C-P) map is proposed. In this map, the value at each location represents the performance with a particular proportion of easy and hard trials, thus offering a global view of the system in various test conditions. The identified hard trials and the code of the C-P map tool have been released at http://lilt.cslt.org/trials/demo/.
Similar content being viewed by others
Data availability
All the data used in this paper are public data.
Code Availability
The code has been published at https://gitlab.com/csltstu/sunine.
Notes
References
Bai Z, Zhang XL (2021) Speaker recognition based on deep learning: an overview. Neural Netw 140:65–99
Brown A, Huh J, Chung JS et al (2022) VoxSRC 2021: the third VoxCeleb speaker recognition challenge. arXiv:2201.04583
Casanova E, Weber J, Shulby CD et al (2022) Yourtts: towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In: International conference on machine learning, PMLR, pp 2709–2720
Chen Z, Liu B, Han B et al (2022) The SJTU X-LANCE Lab system for CNSRC 2022. arXiv:2206.11699
Chung JS, Nagrani A, Zisserman A (2018) VoxCeleb2: deep speaker recognition. In: Proceedings of the annual conference of international speech communication association (INTERSPEECH), pp 1086–1090
Chung JS, Nagrani A, Coto E, et al (2019) VoxSRC 2019: The first VoxCeleb speaker recognition challenge. arXiv:1912.02522
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Dehak N, Kenny PJ, Dehak R et al (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798
Deng J, Guo J, Xue N et al (2019) Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4690–4699
Desplanques B, Thienpondt J, Demuynck K (2020) ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In: Proceedings of the annual conference of international speech communication association (INTERSPEECH), pp 3830–3834
Fan Y, Kang J, Li L et al (2020) CN-Celeb: a challenging Chinese speaker recognition dataset. In: IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7604–7608
Haeb-Umbach R, Watanabe S, Nakatani T et al (2019) Speech processing for digital home assistants: combining signal processing with deep-learning techniques. IEEE Signal Proc Mag 36(6):111–124
Hansen JH, Hasan T (2015) Speaker recognition by machines and humans: a tutorial review. IEEE Signal Proc Mag 32(6):74–99
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Ioffe S (2006) Probabilistic linear discriminant analysis. In: European conference on computer vision (ECCV). Springer, pp 531–542
Jiang S, Chen J, Liu Q et al (2022) The STAP system for CN-Celeb speaker recognition challange 2022. https://aishell-cnsrc.oss-cn-hangzhou.aliyuncs.com/T106.pdf
Kabir MM, Mridha MF, Shin J et al (2021) A survey of speaker recognition: fundamental theories, recognition methods and opportunities. IEEE Access 9:79,236–79,263
Kendall MG (1938) A new measure of rank correlation. Biometrika 30(1/2):81–93
Kinnunen T, Nautsch A, Sahidullah M et al (2021) Visualizing classifier adjacency relations: a case study in speaker verification and voice anti-spoofing. In: Proceedings of the annual conference of international speech communication association (INTERSPEECH), pp 4299–4303
Li L, Chen Y, Shi Y et al (2017) Deep speaker feature learning for text-independent speaker verification. In: Proceedings of the annual conference of international speech communication association (INTERSPEECH), pp 1542–1546
Li L, Jiang T, Hong Q et al (2022a) CNSRC 2022 technical report. http://cnceleb.org/workshop
Li L, Liu R, Kang J et al (2022) CN-Celeb: multi-genre speaker recognition. Speech Comm 137:77–91
Li L, Wang D, Du W et al (2022) C-P map: a novel evaluation toolkit for speaker verification. Odyssey 2022:306–313
Li L, Wang D, Wang D (2022d) Pay attention to hard trials. In: 2022 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC). IEEE, pp 204–209
Martin AF, Greenberg CS (2009) NIST 2008 speaker recognition evaluation: performance across telephone and room microphone channels. In: Proceedings of the annual conference of international speech communication association (INTERRSPEECH), pp 2579–2582
Matejka P, Novotnỳ O, Plchot O et al (2017) Analysis of score normalization in multilingual speaker recognition. In: Proceedings of the annual conference of international speech communication association (INTERSPEECH), pp 1567–1571
McLaren M, Ferrer L, Castan D et al (2016) The speakers in the wild (SITW) speaker recognition database. In: Proceedings of the annual conference of international speech communication association (INTERSPEECH), pp 818–822
Nagrani A, Chung JS, Zisserman A (2017) VoxCeleb: A large-scale speaker identification dataset. In: Proceedings of the annual conference of international speech communication association (INTERSPEECH), pp 2616–2620
Nagrani A, Chung JS, Huh J et al (2020a) VoxSRC 2020: the second VoxCeleb speaker recognition challenge. arXiv:2012.06867
Nagrani A, Chung JS, Xie W et al (2020) VoxCeleb: large-scale speaker verification in the wild. Computer Speech & Language 60(101):027
NIST (2010) NIST 2010 sre evaluation plan. Available online: http://www.nist.gov/itl/iad/mig/upload/NIST_SRE10_evalplan-r6.pdf
Okabe K, Koshinaka T, Shinoda K (2018) Attentive statistics pooling for deep speaker embedding. In: Proceedings of the annual conference of international speech communication association (INTERSPEECH), pp 2252–2256
Park TJ, Kanda N, Dimitriadis D et al (2022) A review of speaker diarization: recent advances with deep learning. Computer Speech & Language 72(101):317
Povey D, Ghoshal A, Boulianne G et al (2011) The Kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding (ASRU), IEEE
Snyder D, Garcia-Romero D, Sell G et al (2018) X-vectors: robust DNN embeddings for speaker recognition. In: IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5329–5333
Sun H, Wang D, Li L et al (2023) Random cycle loss and its application to voice conversion. IEEE Trans Pattern Anal Mach Intell
Thienpondt J, Desplanques B, Demuynck K (2021) The IDLAB VoxSRC-20 submission: large margin fine-tuning and quality-aware score calibration in dnn based speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5814–5818
Tong F, Zhao M, Zhou J et al (2021) ASV-Subtools: open source toolkit for automatic speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6184–6188
Torgerson WS (1958) Theory and methods of scaling
Tuncer T, Dogan S, Ertam F (2019) Automatic voice based disease detection method using one dimensional local binary pattern feature extraction network. Appl Acoust 155:500–506
Vapnik V (1963) Pattern recognition using generalized portrait method. Autom Remote Control 24:774–780
Vapnik V (1982) Estimation of dependences based on empirical data. Springer Science & Business Media
Variani E, Lei X, McDermott E et al (2014) Deep neural networks for small footprint text-dependent speaker verification. IEEE international conference on acoustics. Speech and signal processing (ICASSP). IEEE, pp 4052–4056
Villalba J, Chen N, Snyder D et al (2020) State-of-the-art speaker recognition with neural network embeddings in nist sre18 and speakers in the wild evaluations. Computer Speech & Language 60(101):026
WANG D, Hong Q, Li L et al (2022) Cnsrc 2022 evaluation plan. http://aishell-cnsrc.oss-cn-hangzhou.aliyuncs.com/cnsrc-v2.0.pdf
Wang F, Cheng J, Liu W et al (2018) Additive margin softmax for face verification. IEEE Sig Process Lett 25(7):926–930
Xie W, Nagrani A, Chung JS et al (2019) Utterance-level aggregation for speaker recognition in the wild. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5791–5795
Zheng Y, Chen Y, Peng J et al (2022) The SpeakIn system description for CNSRC 2022. arXiv:2209.10846
Zhou T, Zhao Y, Wu J (2021) ResNeXt and Res2Net structures for speaker verification. In: 2021 IEEE Spoken language technology workshop (SLT). IEEE, pp 301–307
Zhu Y, Ko T, Snyder D et al (2018) Self-attentive speaker embeddings for text-independent speaker verification. In: Proceedings of the annual conference of international speech communication association (INTERSPEECH), pp 3573–3577
Funding
This work was supported by the National Natural Science Foundation of China (NSFC) under Grant No.62171250 and also the Fundamental Research Funds for the Central Universities of China.
Author information
Authors and Affiliations
Contributions
L.L.: Methodology, Software, Writing - Original Draft; Di W.: Software; A.A.: Writing - Review & Editing; Dong W.: Conceptualization, Supervision, Funding acquisition, Writing - Review & Editing.
Corresponding author
Ethics declarations
Ethical approval
Not applicable.
Consent to participate
Not applicable.
Consent to publication
Participants were informed that the results of their opinions would be published in a way that their identity could not be revealed.
Conflict of interest
The authors of this paper declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, L., Wang, D., Abel, A. et al. On evaluation trials in speaker verification. Appl Intell 54, 113–130 (2024). https://doi.org/10.1007/s10489-023-05071-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-05071-9