Skip to main content

Deep Speaker Embeddings for Speaker Verification of Children

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2024)

Abstract

Currently, deep speaker embedding models are the most advanced feature extraction methods for speaker verification. However, their effectiveness in identifying children’s voices has not been thoroughly researched. While various methods have been proposed in recent years, most of them concentrate on adult speakers, with fewer researchers focusing on children. This study examines three deep learning-based speaker embedding methods and their ability to differentiate between child speakers in speaker verification. The study evaluated the X-vector, ECAPA-TDNN, and RESNET-TDNN methods for forensic voice comparison using pre-trained models and fine-tuning them on children’s speech samples. The likelihood-ratio framework was used for evaluations using the likelihood-ratio score calculation method based on children’s voices. The Samromur Children dataset was used to evaluate the workflow. It comprises 131 h of speech from 3175 speakers aged between 4 and 17 of both sexes. The results indicate that RESNET-TDNN has the lowest EER and \( Cllr _{min}\) values (10.8% and 0.368, respectively) without fine-tuning the embedding models. With fine-tuning, ECAPA-TDNN performs the best (EER and \( Cllr _{min}\) are 2.9% and 0.111, respectively). No difference was found between the sexes of the speakers. When the results were analysed based on the age range of the speakers (4–10, 11–15, and 16–17), varying levels of performance were observed. The younger speakers were less accurately identified using the original pre-trained models. However, after fine-tuning, this tendency changed slightly. The results indicate that the models could be used in real-life investigation cases and fine-tuning helps mitigating the performance degradation in young speakers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://huggingface.co/speechbrain/spkrec-xvect-voxceleb.

  2. 2.

    https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb.

  3. 3.

    https://huggingface.co/speechbrain/spkrec-resnet-voxceleb.

  4. 4.

    https://huggingface.co/datasets/language-and-voice-lab/samromur_children.

References

  1. Abed, M.H., Sztahó, D.: Effect of identical twins on deep speaker embeddings based forensic voice comparison. Int. J. Speech Technol. 27, 1–11 (2024)

    Google Scholar 

  2. Aziz, S., Shahnawazuddin, S.: Experimental studies for improving the performance of children’s speaker verification system using short utterances. Appl. Acoust. 216, 109783 (2024)

    Article  Google Scholar 

  3. Biosa, G., Giurghita, D., Alladio, E., Vincenti, M., Neocleous, T.: Evaluation of forensic data using logistic regression-based classification methods and an r shiny implementation. Front. Chem. 8, 738 (2020)

    Article  Google Scholar 

  4. Brümmer, N., Du Preez, J.: Application-independent evaluation of speaker detection. Comput. Speech Lang. 20(2–3), 230–275 (2006)

    Article  Google Scholar 

  5. Desplanques, B., Thienpondt, J., Demuynck, K.: ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. Proc. Interspeech 2020, 3830–3834 (2020)

    Google Scholar 

  6. Gonzalez-Rodriguez, J., Drygajlo, A., Ramos-Castro, D., Garcia-Gomar, M., Ortega-Garcia, J.: Robust estimation, interpretation and assessment of likelihood ratios in forensic speaker recognition. Comput. Speech Lang. 20(2–3), 331–355 (2006)

    Article  Google Scholar 

  7. Hernández Mena, C.D., et al.: Samrómur Children Icelandic Speech 1.0 (2021). publication Title: Linguistic Data Consortium, Philadelphia

    Google Scholar 

  8. Morrison, G.S.: Forensic voice comparison and the paradigm shift. Sci. Justice 49(4), 298–308 (2009)

    Article  Google Scholar 

  9. Morrison, G.S.: A comparison of procedures for the calculation of forensic likelihood ratios from acoustic-phonetic data: multivariate kernel density (MVKD) versus Gaussian mixture model-universal background model (GMM-UBM). Speech Commun. 53(2), 242–256 (2011). Elsevier

    Google Scholar 

  10. Morrison, G.S., et al.: INTERPOL survey of the use of speaker identification by law enforcement agencies. Forensic Sci. Int. 263, 92–100 (2016)

    Article  Google Scholar 

  11. Morrison, G.S., Weber, P., Basu, N., Puch-Solis, R., Randolph-Quinney, P.S.: Calculation of likelihood ratios for inference of biological sex from human skeletal remains. Forensic Sci. Int. Synergy 3, 100202 (2021)

    Article  Google Scholar 

  12. Morrison, G.S., Zhang, C.: Forensic voice comparison: overview. Encycl. Forensic Sci. 2, 737–750 (2023)

    Google Scholar 

  13. Ravanelli, M., et al.: SpeechBrain: A general-purpose speech toolkit (2021). arXiv preprint arXiv:2106.04624

  14. Rose, P.: More is better: likelihood ratio-based forensic voice comparison with vocalic segmental Cepstra frontends. Int. J. Speech Lang. Law 20(1), 77–116 (2013)

    Google Scholar 

  15. Safavi, S.: Speaker characterization using adult and children’s speech. PhD Thesis, University of Birmingham (2015)

    Google Scholar 

  16. Safavi, S., Najafian, M., Hanani, A., Russell, M., Jančovič, P.: Comparison of speaker verification performance for adult and child speech. Workshop on Child Computer Interaction (2014)

    Google Scholar 

  17. Safavi, S., Russell, M., Jančovič, P.: Automatic speaker, age-group and gender identification from children’s speech. Comput. Speech Lang. 50, 141–156 (2018)

    Article  Google Scholar 

  18. Shahnawazuddin, S., Ahmad, W., Adiga, N., Kumar, A.: Children’s speaker verification in low and zero resource conditions. Digit. Signal Proc. 116, 103115 (2021)

    Article  Google Scholar 

  19. Singh, V.P., Sahidullah, M., Kinnunen, T.: ChildAugment: Data Augmentation Methods for Zero-Resource Children’s Speaker Verification (2024). arXiv preprint arXiv:2402.15214

  20. Snyder, D., Garcia-Romero, D., McCree, A., Sell, G., Povey, D., Khudanpur, S.: Spoken language recognition using x-vectors. In: Odyssey. vol. 2018, pp. 105–111 (2018)

    Google Scholar 

  21. Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. Interspeech, pp. 999–1003 (2017)

    Google Scholar 

  22. Sztahó, D., Fejes, A.: Effects of language mismatch in automatic forensic voice comparison using deep learning embeddings. J. Forensic Sci. 68, 871–883 (2023)

    Article  Google Scholar 

  23. Villalba, J., et al.: State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and speakers in the wild evaluations. Comput. Speech Lang. 60, 101026 (2020)

    Article  Google Scholar 

  24. van der Vloed, D.: Data strategies in forensic automatic speaker comparison. Forensic Sci. Int. 350, 111790 (2023). Elsevier

    Google Scholar 

Download references

Acknowledgement

The work was funded by project no. FK128615, which has been implemented with the support provided from the National Research, Development and Innovation Fund of Hungary, financed under the FK_18 funding scheme, and Stipendium Hungaricum Programme.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammed Hamzah Abed .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Abed, M.H., Sztahó, D. (2024). Deep Speaker Embeddings for Speaker Verification of Children. In: Nöth, E., Horák, A., Sojka, P. (eds) Text, Speech, and Dialogue. TSD 2024. Lecture Notes in Computer Science(), vol 15049. Springer, Cham. https://doi.org/10.1007/978-3-031-70566-3_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70566-3_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70565-6

  • Online ISBN: 978-3-031-70566-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics