Abstract
Currently, deep speaker embedding models are the most advanced feature extraction methods for speaker verification. However, their effectiveness in identifying children’s voices has not been thoroughly researched. While various methods have been proposed in recent years, most of them concentrate on adult speakers, with fewer researchers focusing on children. This study examines three deep learning-based speaker embedding methods and their ability to differentiate between child speakers in speaker verification. The study evaluated the X-vector, ECAPA-TDNN, and RESNET-TDNN methods for forensic voice comparison using pre-trained models and fine-tuning them on children’s speech samples. The likelihood-ratio framework was used for evaluations using the likelihood-ratio score calculation method based on children’s voices. The Samromur Children dataset was used to evaluate the workflow. It comprises 131 h of speech from 3175 speakers aged between 4 and 17 of both sexes. The results indicate that RESNET-TDNN has the lowest EER and \( Cllr _{min}\) values (10.8% and 0.368, respectively) without fine-tuning the embedding models. With fine-tuning, ECAPA-TDNN performs the best (EER and \( Cllr _{min}\) are 2.9% and 0.111, respectively). No difference was found between the sexes of the speakers. When the results were analysed based on the age range of the speakers (4–10, 11–15, and 16–17), varying levels of performance were observed. The younger speakers were less accurately identified using the original pre-trained models. However, after fine-tuning, this tendency changed slightly. The results indicate that the models could be used in real-life investigation cases and fine-tuning helps mitigating the performance degradation in young speakers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abed, M.H., Sztahó, D.: Effect of identical twins on deep speaker embeddings based forensic voice comparison. Int. J. Speech Technol. 27, 1–11 (2024)
Aziz, S., Shahnawazuddin, S.: Experimental studies for improving the performance of children’s speaker verification system using short utterances. Appl. Acoust. 216, 109783 (2024)
Biosa, G., Giurghita, D., Alladio, E., Vincenti, M., Neocleous, T.: Evaluation of forensic data using logistic regression-based classification methods and an r shiny implementation. Front. Chem. 8, 738 (2020)
Brümmer, N., Du Preez, J.: Application-independent evaluation of speaker detection. Comput. Speech Lang. 20(2–3), 230–275 (2006)
Desplanques, B., Thienpondt, J., Demuynck, K.: ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. Proc. Interspeech 2020, 3830–3834 (2020)
Gonzalez-Rodriguez, J., Drygajlo, A., Ramos-Castro, D., Garcia-Gomar, M., Ortega-Garcia, J.: Robust estimation, interpretation and assessment of likelihood ratios in forensic speaker recognition. Comput. Speech Lang. 20(2–3), 331–355 (2006)
Hernández Mena, C.D., et al.: Samrómur Children Icelandic Speech 1.0 (2021). publication Title: Linguistic Data Consortium, Philadelphia
Morrison, G.S.: Forensic voice comparison and the paradigm shift. Sci. Justice 49(4), 298–308 (2009)
Morrison, G.S.: A comparison of procedures for the calculation of forensic likelihood ratios from acoustic-phonetic data: multivariate kernel density (MVKD) versus Gaussian mixture model-universal background model (GMM-UBM). Speech Commun. 53(2), 242–256 (2011). Elsevier
Morrison, G.S., et al.: INTERPOL survey of the use of speaker identification by law enforcement agencies. Forensic Sci. Int. 263, 92–100 (2016)
Morrison, G.S., Weber, P., Basu, N., Puch-Solis, R., Randolph-Quinney, P.S.: Calculation of likelihood ratios for inference of biological sex from human skeletal remains. Forensic Sci. Int. Synergy 3, 100202 (2021)
Morrison, G.S., Zhang, C.: Forensic voice comparison: overview. Encycl. Forensic Sci. 2, 737–750 (2023)
Ravanelli, M., et al.: SpeechBrain: A general-purpose speech toolkit (2021). arXiv preprint arXiv:2106.04624
Rose, P.: More is better: likelihood ratio-based forensic voice comparison with vocalic segmental Cepstra frontends. Int. J. Speech Lang. Law 20(1), 77–116 (2013)
Safavi, S.: Speaker characterization using adult and children’s speech. PhD Thesis, University of Birmingham (2015)
Safavi, S., Najafian, M., Hanani, A., Russell, M., Jančovič, P.: Comparison of speaker verification performance for adult and child speech. Workshop on Child Computer Interaction (2014)
Safavi, S., Russell, M., Jančovič, P.: Automatic speaker, age-group and gender identification from children’s speech. Comput. Speech Lang. 50, 141–156 (2018)
Shahnawazuddin, S., Ahmad, W., Adiga, N., Kumar, A.: Children’s speaker verification in low and zero resource conditions. Digit. Signal Proc. 116, 103115 (2021)
Singh, V.P., Sahidullah, M., Kinnunen, T.: ChildAugment: Data Augmentation Methods for Zero-Resource Children’s Speaker Verification (2024). arXiv preprint arXiv:2402.15214
Snyder, D., Garcia-Romero, D., McCree, A., Sell, G., Povey, D., Khudanpur, S.: Spoken language recognition using x-vectors. In: Odyssey. vol. 2018, pp. 105–111 (2018)
Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. Interspeech, pp. 999–1003 (2017)
Sztahó, D., Fejes, A.: Effects of language mismatch in automatic forensic voice comparison using deep learning embeddings. J. Forensic Sci. 68, 871–883 (2023)
Villalba, J., et al.: State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and speakers in the wild evaluations. Comput. Speech Lang. 60, 101026 (2020)
van der Vloed, D.: Data strategies in forensic automatic speaker comparison. Forensic Sci. Int. 350, 111790 (2023). Elsevier
Acknowledgement
The work was funded by project no. FK128615, which has been implemented with the support provided from the National Research, Development and Innovation Fund of Hungary, financed under the FK_18 funding scheme, and Stipendium Hungaricum Programme.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Abed, M.H., Sztahó, D. (2024). Deep Speaker Embeddings for Speaker Verification of Children. In: Nöth, E., Horák, A., Sojka, P. (eds) Text, Speech, and Dialogue. TSD 2024. Lecture Notes in Computer Science(), vol 15049. Springer, Cham. https://doi.org/10.1007/978-3-031-70566-3_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-70566-3_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70565-6
Online ISBN: 978-3-031-70566-3
eBook Packages: Computer ScienceComputer Science (R0)