Deep Speaker Embeddings for Speaker Verification of Children

Abed, Mohammed Hamzah; Sztahó, Dávid

doi:10.1007/978-3-031-70566-3_6

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15049))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

401 Accesses

Abstract

Currently, deep speaker embedding models are the most advanced feature extraction methods for speaker verification. However, their effectiveness in identifying children’s voices has not been thoroughly researched. While various methods have been proposed in recent years, most of them concentrate on adult speakers, with fewer researchers focusing on children. This study examines three deep learning-based speaker embedding methods and their ability to differentiate between child speakers in speaker verification. The study evaluated the X-vector, ECAPA-TDNN, and RESNET-TDNN methods for forensic voice comparison using pre-trained models and fine-tuning them on children’s speech samples. The likelihood-ratio framework was used for evaluations using the likelihood-ratio score calculation method based on children’s voices. The Samromur Children dataset was used to evaluate the workflow. It comprises 131 h of speech from 3175 speakers aged between 4 and 17 of both sexes. The results indicate that RESNET-TDNN has the lowest EER and $ Cllr _{min}$ values (10.8% and 0.368, respectively) without fine-tuning the embedding models. With fine-tuning, ECAPA-TDNN performs the best (EER and $ Cllr _{min}$ are 2.9% and 0.111, respectively). No difference was found between the sexes of the speakers. When the results were analysed based on the age range of the speakers (4–10, 11–15, and 16–17), varying levels of performance were observed. The younger speakers were less accurately identified using the original pre-trained models. However, after fine-tuning, this tendency changed slightly. The results indicate that the models could be used in real-life investigation cases and fine-tuning helps mitigating the performance degradation in young speakers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Identification of true speakers from disguised voices in anti-forensic scenarios using an efficient framework

Article 23 July 2024

Deep Learning Approaches for Speech Analysis: A Critical Insight

Triplet loss-based embeddings for forensic speaker identification in Spanish

Article 04 September 2021

Notes

References

Abed, M.H., Sztahó, D.: Effect of identical twins on deep speaker embeddings based forensic voice comparison. Int. J. Speech Technol. 27, 1–11 (2024)
Google Scholar
Aziz, S., Shahnawazuddin, S.: Experimental studies for improving the performance of children’s speaker verification system using short utterances. Appl. Acoust. 216, 109783 (2024)
Article Google Scholar
Biosa, G., Giurghita, D., Alladio, E., Vincenti, M., Neocleous, T.: Evaluation of forensic data using logistic regression-based classification methods and an r shiny implementation. Front. Chem. 8, 738 (2020)
Article Google Scholar
Brümmer, N., Du Preez, J.: Application-independent evaluation of speaker detection. Comput. Speech Lang. 20(2–3), 230–275 (2006)
Article Google Scholar
Desplanques, B., Thienpondt, J., Demuynck, K.: ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. Proc. Interspeech 2020, 3830–3834 (2020)
Google Scholar
Gonzalez-Rodriguez, J., Drygajlo, A., Ramos-Castro, D., Garcia-Gomar, M., Ortega-Garcia, J.: Robust estimation, interpretation and assessment of likelihood ratios in forensic speaker recognition. Comput. Speech Lang. 20(2–3), 331–355 (2006)
Article Google Scholar
Hernández Mena, C.D., et al.: Samrómur Children Icelandic Speech 1.0 (2021). publication Title: Linguistic Data Consortium, Philadelphia
Google Scholar
Morrison, G.S.: Forensic voice comparison and the paradigm shift. Sci. Justice 49(4), 298–308 (2009)
Article Google Scholar
Morrison, G.S.: A comparison of procedures for the calculation of forensic likelihood ratios from acoustic-phonetic data: multivariate kernel density (MVKD) versus Gaussian mixture model-universal background model (GMM-UBM). Speech Commun. 53(2), 242–256 (2011). Elsevier
Google Scholar
Morrison, G.S., et al.: INTERPOL survey of the use of speaker identification by law enforcement agencies. Forensic Sci. Int. 263, 92–100 (2016)
Article Google Scholar
Morrison, G.S., Weber, P., Basu, N., Puch-Solis, R., Randolph-Quinney, P.S.: Calculation of likelihood ratios for inference of biological sex from human skeletal remains. Forensic Sci. Int. Synergy 3, 100202 (2021)
Article Google Scholar
Morrison, G.S., Zhang, C.: Forensic voice comparison: overview. Encycl. Forensic Sci. 2, 737–750 (2023)
Google Scholar
Ravanelli, M., et al.: SpeechBrain: A general-purpose speech toolkit (2021). arXiv preprint arXiv:2106.04624
Rose, P.: More is better: likelihood ratio-based forensic voice comparison with vocalic segmental Cepstra frontends. Int. J. Speech Lang. Law 20(1), 77–116 (2013)
Google Scholar
Safavi, S.: Speaker characterization using adult and children’s speech. PhD Thesis, University of Birmingham (2015)
Google Scholar
Safavi, S., Najafian, M., Hanani, A., Russell, M., Jančovič, P.: Comparison of speaker verification performance for adult and child speech. Workshop on Child Computer Interaction (2014)
Google Scholar
Safavi, S., Russell, M., Jančovič, P.: Automatic speaker, age-group and gender identification from children’s speech. Comput. Speech Lang. 50, 141–156 (2018)
Article Google Scholar
Shahnawazuddin, S., Ahmad, W., Adiga, N., Kumar, A.: Children’s speaker verification in low and zero resource conditions. Digit. Signal Proc. 116, 103115 (2021)
Article Google Scholar
Singh, V.P., Sahidullah, M., Kinnunen, T.: ChildAugment: Data Augmentation Methods for Zero-Resource Children’s Speaker Verification (2024). arXiv preprint arXiv:2402.15214
Snyder, D., Garcia-Romero, D., McCree, A., Sell, G., Povey, D., Khudanpur, S.: Spoken language recognition using x-vectors. In: Odyssey. vol. 2018, pp. 105–111 (2018)
Google Scholar
Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. Interspeech, pp. 999–1003 (2017)
Google Scholar
Sztahó, D., Fejes, A.: Effects of language mismatch in automatic forensic voice comparison using deep learning embeddings. J. Forensic Sci. 68, 871–883 (2023)
Article Google Scholar
Villalba, J., et al.: State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and speakers in the wild evaluations. Comput. Speech Lang. 60, 101026 (2020)
Article Google Scholar
van der Vloed, D.: Data strategies in forensic automatic speaker comparison. Forensic Sci. Int. 350, 111790 (2023). Elsevier
Google Scholar

Download references

Acknowledgement

The work was funded by project no. FK128615, which has been implemented with the support provided from the National Research, Development and Innovation Fund of Hungary, financed under the FK_18 funding scheme, and Stipendium Hungaricum Programme.

Author information

Authors and Affiliations

Department of Telecommunications and Artificial Intelligence, Budapest University of Technology and Economics, Magyar tudósok körútja 2, Budapest, 1117, Hungary
Mohammed Hamzah Abed & Dávid Sztahó

Authors

Mohammed Hamzah Abed
View author publications
You can also search for this author in PubMed Google Scholar
Dávid Sztahó
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammed Hamzah Abed .

Editor information

Editors and Affiliations

Friedrich-Alexander-Universität, Erlangen, Germany
Elmar Nöth
Masaryk University, Brno, Czech Republic
Aleš Horák
Masaryk University, Brno, Czech Republic
Petr Sojka

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abed, M.H., Sztahó, D. (2024). Deep Speaker Embeddings for Speaker Verification of Children. In: Nöth, E., Horák, A., Sojka, P. (eds) Text, Speech, and Dialogue. TSD 2024. Lecture Notes in Computer Science(), vol 15049. Springer, Cham. https://doi.org/10.1007/978-3-031-70566-3_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-70566-3_6
Published: 27 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70565-6
Online ISBN: 978-3-031-70566-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Deep Speaker Embeddings for Speaker Verification of Children