Skip to main content
Log in

On evaluation trials in speaker verification

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Evaluation trials are crucial to measure performance of speaker verification systems. However, the design of trials that can faithfully reflect system performance and accurately distinguish between different systems remains an open issue. In this paper, we focus on a particular problem: the impact of trials that are easy to solve for the majority of systems. We show that these ‘easy trials’ not only report over-optimistic absolute performance, but also lead to biased relative performance in system comparisons when they are asymmetrically distributed. This motivated the idea of mining ‘hard trials’, i.e., trials that are regarded to be difficult by current representative techniques. Three approaches to retrieving hard trials will be reported, and the properties of the retrieved hard trials are studied, from the perspectives of both machines and humans. Finally, a novel visualization tool which we name a Config-Performance (C-P) map is proposed. In this map, the value at each location represents the performance with a particular proportion of easy and hard trials, thus offering a global view of the system in various test conditions. The identified hard trials and the code of the C-P map tool have been released at http://lilt.cslt.org/trials/demo/.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

All the data used in this paper are public data.

Code Availability

The code has been published at https://gitlab.com/csltstu/sunine.

Notes

  1. http://cnceleb.org/competition.

  2. http://www.robots.ox.ac.uk/\(\sim \)vgg/data/voxceleb/:

  3. http://www.cnceleb.org/:

  4. https://github.com/kaldi-asr/kaldi/tree/master/egs/voxceleb/

  5. https://gitlab.com/csltstu/sunine.

  6. https://gitlab.com/csltstu/sunine

  7. https://github.com/Snowdar/asv-subtools

  8. https://github.com/wenet-e2e/wespeaker

References

  1. Bai Z, Zhang XL (2021) Speaker recognition based on deep learning: an overview. Neural Netw 140:65–99

    Article  Google Scholar 

  2. Brown A, Huh J, Chung JS et al (2022) VoxSRC 2021: the third VoxCeleb speaker recognition challenge. arXiv:2201.04583

  3. Casanova E, Weber J, Shulby CD et al (2022) Yourtts: towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In: International conference on machine learning, PMLR, pp 2709–2720

  4. Chen Z, Liu B, Han B et al (2022) The SJTU X-LANCE Lab system for CNSRC 2022. arXiv:2206.11699

  5. Chung JS, Nagrani A, Zisserman A (2018) VoxCeleb2: deep speaker recognition. In: Proceedings of the annual conference of international speech communication association (INTERSPEECH), pp 1086–1090

  6. Chung JS, Nagrani A, Coto E, et al (2019) VoxSRC 2019: The first VoxCeleb speaker recognition challenge. arXiv:1912.02522

  7. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    Article  Google Scholar 

  8. Dehak N, Kenny PJ, Dehak R et al (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798

    Article  Google Scholar 

  9. Deng J, Guo J, Xue N et al (2019) Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4690–4699

  10. Desplanques B, Thienpondt J, Demuynck K (2020) ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In: Proceedings of the annual conference of international speech communication association (INTERSPEECH), pp 3830–3834

  11. Fan Y, Kang J, Li L et al (2020) CN-Celeb: a challenging Chinese speaker recognition dataset. In: IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7604–7608

  12. Haeb-Umbach R, Watanabe S, Nakatani T et al (2019) Speech processing for digital home assistants: combining signal processing with deep-learning techniques. IEEE Signal Proc Mag 36(6):111–124

    Article  Google Scholar 

  13. Hansen JH, Hasan T (2015) Speaker recognition by machines and humans: a tutorial review. IEEE Signal Proc Mag 32(6):74–99

    Article  Google Scholar 

  14. He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  15. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141

  16. Ioffe S (2006) Probabilistic linear discriminant analysis. In: European conference on computer vision (ECCV). Springer, pp 531–542

  17. Jiang S, Chen J, Liu Q et al (2022) The STAP system for CN-Celeb speaker recognition challange 2022. https://aishell-cnsrc.oss-cn-hangzhou.aliyuncs.com/T106.pdf

  18. Kabir MM, Mridha MF, Shin J et al (2021) A survey of speaker recognition: fundamental theories, recognition methods and opportunities. IEEE Access 9:79,236–79,263

  19. Kendall MG (1938) A new measure of rank correlation. Biometrika 30(1/2):81–93

    Article  Google Scholar 

  20. Kinnunen T, Nautsch A, Sahidullah M et al (2021) Visualizing classifier adjacency relations: a case study in speaker verification and voice anti-spoofing. In: Proceedings of the annual conference of international speech communication association (INTERSPEECH), pp 4299–4303

  21. Li L, Chen Y, Shi Y et al (2017) Deep speaker feature learning for text-independent speaker verification. In: Proceedings of the annual conference of international speech communication association (INTERSPEECH), pp 1542–1546

  22. Li L, Jiang T, Hong Q et al (2022a) CNSRC 2022 technical report. http://cnceleb.org/workshop

  23. Li L, Liu R, Kang J et al (2022) CN-Celeb: multi-genre speaker recognition. Speech Comm 137:77–91

    Article  Google Scholar 

  24. Li L, Wang D, Du W et al (2022) C-P map: a novel evaluation toolkit for speaker verification. Odyssey 2022:306–313

    Article  Google Scholar 

  25. Li L, Wang D, Wang D (2022d) Pay attention to hard trials. In: 2022 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC). IEEE, pp 204–209

  26. Martin AF, Greenberg CS (2009) NIST 2008 speaker recognition evaluation: performance across telephone and room microphone channels. In: Proceedings of the annual conference of international speech communication association (INTERRSPEECH), pp 2579–2582

  27. Matejka P, Novotnỳ O, Plchot O et al (2017) Analysis of score normalization in multilingual speaker recognition. In: Proceedings of the annual conference of international speech communication association (INTERSPEECH), pp 1567–1571

  28. McLaren M, Ferrer L, Castan D et al (2016) The speakers in the wild (SITW) speaker recognition database. In: Proceedings of the annual conference of international speech communication association (INTERSPEECH), pp 818–822

  29. Nagrani A, Chung JS, Zisserman A (2017) VoxCeleb: A large-scale speaker identification dataset. In: Proceedings of the annual conference of international speech communication association (INTERSPEECH), pp 2616–2620

  30. Nagrani A, Chung JS, Huh J et al (2020a) VoxSRC 2020: the second VoxCeleb speaker recognition challenge. arXiv:2012.06867

  31. Nagrani A, Chung JS, Xie W et al (2020) VoxCeleb: large-scale speaker verification in the wild. Computer Speech & Language 60(101):027

    Google Scholar 

  32. NIST (2010) NIST 2010 sre evaluation plan. Available online: http://www.nist.gov/itl/iad/mig/upload/NIST_SRE10_evalplan-r6.pdf

  33. Okabe K, Koshinaka T, Shinoda K (2018) Attentive statistics pooling for deep speaker embedding. In: Proceedings of the annual conference of international speech communication association (INTERSPEECH), pp 2252–2256

  34. Park TJ, Kanda N, Dimitriadis D et al (2022) A review of speaker diarization: recent advances with deep learning. Computer Speech & Language 72(101):317

    Google Scholar 

  35. Povey D, Ghoshal A, Boulianne G et al (2011) The Kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding (ASRU), IEEE

  36. Snyder D, Garcia-Romero D, Sell G et al (2018) X-vectors: robust DNN embeddings for speaker recognition. In: IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5329–5333

  37. Sun H, Wang D, Li L et al (2023) Random cycle loss and its application to voice conversion. IEEE Trans Pattern Anal Mach Intell

  38. Thienpondt J, Desplanques B, Demuynck K (2021) The IDLAB VoxSRC-20 submission: large margin fine-tuning and quality-aware score calibration in dnn based speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5814–5818

  39. Tong F, Zhao M, Zhou J et al (2021) ASV-Subtools: open source toolkit for automatic speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6184–6188

  40. Torgerson WS (1958) Theory and methods of scaling

  41. Tuncer T, Dogan S, Ertam F (2019) Automatic voice based disease detection method using one dimensional local binary pattern feature extraction network. Appl Acoust 155:500–506

    Article  Google Scholar 

  42. Vapnik V (1963) Pattern recognition using generalized portrait method. Autom Remote Control 24:774–780

    Google Scholar 

  43. Vapnik V (1982) Estimation of dependences based on empirical data. Springer Science & Business Media

  44. Variani E, Lei X, McDermott E et al (2014) Deep neural networks for small footprint text-dependent speaker verification. IEEE international conference on acoustics. Speech and signal processing (ICASSP). IEEE, pp 4052–4056

  45. Villalba J, Chen N, Snyder D et al (2020) State-of-the-art speaker recognition with neural network embeddings in nist sre18 and speakers in the wild evaluations. Computer Speech & Language 60(101):026

    Google Scholar 

  46. WANG D, Hong Q, Li L et al (2022) Cnsrc 2022 evaluation plan. http://aishell-cnsrc.oss-cn-hangzhou.aliyuncs.com/cnsrc-v2.0.pdf

  47. Wang F, Cheng J, Liu W et al (2018) Additive margin softmax for face verification. IEEE Sig Process Lett 25(7):926–930

    Article  Google Scholar 

  48. Xie W, Nagrani A, Chung JS et al (2019) Utterance-level aggregation for speaker recognition in the wild. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5791–5795

  49. Zheng Y, Chen Y, Peng J et al (2022) The SpeakIn system description for CNSRC 2022. arXiv:2209.10846

  50. Zhou T, Zhao Y, Wu J (2021) ResNeXt and Res2Net structures for speaker verification. In: 2021 IEEE Spoken language technology workshop (SLT). IEEE, pp 301–307

  51. Zhu Y, Ko T, Snyder D et al (2018) Self-attentive speaker embeddings for text-independent speaker verification. In: Proceedings of the annual conference of international speech communication association (INTERSPEECH), pp 3573–3577

Download references

Funding

This work was supported by the National Natural Science Foundation of China (NSFC) under Grant No.62171250 and also the Fundamental Research Funds for the Central Universities of China.

Author information

Authors and Affiliations

Authors

Contributions

L.L.: Methodology, Software, Writing - Original Draft; Di W.: Software; A.A.: Writing - Review & Editing; Dong W.: Conceptualization, Supervision, Funding acquisition, Writing - Review & Editing.

Corresponding author

Correspondence to Dong Wang.

Ethics declarations

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent to publication

Participants were informed that the results of their opinions would be published in a way that their identity could not be revealed.

Conflict of interest

The authors of this paper declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, L., Wang, D., Abel, A. et al. On evaluation trials in speaker verification. Appl Intell 54, 113–130 (2024). https://doi.org/10.1007/s10489-023-05071-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-05071-9

Keywords

Navigation