Abstract
The task of developing an automatic speaker verification (ASV) system for children’s speech is a challenging one due to a number of reasons. The dearth of domain-specific data is one among them. The challenge further intensifies with the introduction of short utterances of speech, a relatively unexplored domain in the case of children’s ASV. To circumvent the issue arising due to data scarcity, the work in this paper extensively explores various in-domain and out-of-domain data augmentation techniques. A data augmentation approach is proposed that encompasses both in-domain and out-of-domain data augmentation techniques. The out-of-domain data used are from adult speakers which are known to have acoustic attributes in stark contrast to child speakers. Consequently, various techniques like prosody modification, formant modification and voice-conversion are employed in order to modify the adult acoustic features and render it acoustically similar to children’s speech prior to augmentation. The in-domain data augmentation approach, on the other hand, involved speed perturbation of children’s speech. The proposed data augmentation approach helps not only in increasing the amount of training data but also in effectively capturing the missing target attributes which helps in boosting the verification performance. A relative improvement of \(43.91\%\) in equal error rate (EER) with respect to the baseline system is a testimony of it. Furthermore, the commonly used Mel-frequency cepstral coefficients (MFCC) average out the higher-frequency components due to the larger bandwidth of the filter-bank. Therefore, effective preservation of higher-frequency contents in children’s speech is another challenge which must be appropriately tackled for the development of a reliable and robust children’stion techniques and Feature Concatenation A ASV system. The feature concatenation of MFCC and IMFCC is carried out with the sole intention of effectively preserving the higher-frequency contents in the children’s speech data. The feature concatenation approach, when combined with proposed data augmentation, helps in further improvement of the verification performance and results in an overall relative reduction of \(48.51\%\) for equal error rate.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Badillo-Urquiola, K., Smriti, D., McNally, B., Golub, E., Bonsignore, E., Wisniewski, P.J.: Stranger danger! social media app features co-designed with children to keep them safe online. In: Proceedings of the 18th ACM International Conference on Interaction Design and Children, pp. 394–406 (2019)
D’Arcy, S., Russell, M.: A comparison of human and computer recognition accuracy for children’s speech. In: Ninth European Conference on Speech Communication and Technology (2005)
Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980). https://doi.org/10.1109/TASSP.1980.1163420
Eskenazi, M., Mostow, J., Graff, D.: The CMU Kids Corpus LDC97S63 (1997). https://catalog.ldc.upenn.edu/LDC97S63
Gerosa, M., Giuliani, D., Narayanan, S., Potamianos, A.: A review of ASR technologies for children’s speech. In: Proceedings of the Workshop on Child, Computer and Interaction, pp. 7:1–7:8 (2009)
Hanifa, R.M., Isa, K., Mohamad, S.: A review on speaker recognition: technology and challenges. Comput. Electr. Eng. 90, 107005 (2021)
Kaneko, T., Kameoka, H.: Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint arXiv:1711.11293 (2017)
Kathania, H.K., Kadiri, S.R., Alku, P., Kurimo, M.: Study of formant modification for children ASR. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7429–7433. IEEE (2020)
Kathania, H.K., Shahnawazuddin, S., Ahmad, W., Adiga, N.: Role of linear, mel and inverse-mel filterbanks in automatic recognition of speech from high-pitched speakers. Circuits Syst. Sig. Process. 38(10), 4667–4682 (2019)
Kumar, V., Kumar, A., Shahnawazuddin, S.: Creating robust children’s ASR system in zero-resource condition through out-of-domain data augmentation. Circuits Syst. Signal Process. 41(4), 2205–2220 (2022)
Lee, S., Potamianos, A., Narayanan, S.S.: Acoustics of children’s speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)
Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Proceedings of the INTERSPEECH (2015)
Poddar, A., Sahidullah, M., Saha, G.: Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biometrics 7(2), 91–101 (2018)
Povey, D., et al.: The kaldi speech recognition toolkit. In: Proceedings of the ASRU (2011)
Povey, D., Zhang, X., Khudanpur, S.: Parallel training of deep neural networks with natural gradient and parameter averaging. In: Proceedings of the ICLR (2015)
Prasanna, S.R.M., Govind, D., Rao, K.S., Yegnanarayana, B.: Fast prosody modification using instants of significant excitation. In: Proceedings of the International Conference on Speech Prosody (2010)
Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S.: WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition. In: Proceedings of the ICASSP, vol. 1, pp. 81–84 (1995)
Russell, M., D’Arcy, S.: Challenges for computer recognition of children’s speech. In: Proceedings of the Speech and Language Technologies in Education (SLaTE) (2007)
Russell, M., D’Arcy, S., Qun, L.: The effects of bandwidth reduction on human and computer recognition of children’s speech. IEEE Sig. Process. Lett. 14(12), 1044–1046 (2007)
Safavi, S., Russell, M., Jancovic, P.: Automatic speaker, age-group and gender identification from children’s speech. Comput. Speech Lang. 50, 141–156 (2018)
Shahnawazuddin, S., Adiga, N., Kathania, H.K., Sai, B.T.: Creating speaker independent ASR system through prosody modification based data augmentation. Pattern Recogn. Lett. 131, 213–218 (2020). https://doi.org/10.1016/j.patrec.2019.12.019
Shahnawazuddin, S., Adiga, N., Sai, B.T., Ahmad, W., Kathania, H.K.: Developing speaker independent ASR system using limited data through prosody modification based on fuzzy classification of spectral bins. Digit. Sig. Process. 93, 34–42 (2019)
Shahnawazuddin, S., Ahmad, W., Adiga, N., Kumar, A.: In-domain and out-of-domain data augmentation to improve children’s speaker verification system in limited data scenario. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7554–7558 (2020)
Shobaki, K., Hosom, J.P., Cole, R.: CSLU: kids’ speech version 1.1. Linguistic Data Consortium (2007)
Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. In: Proceedings of the INTERSPEECH, pp. 999–1003 (2017)
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-Vectors: Robust DNN embeddings for speaker recognition. In: Proceedings of the ICASSP, pp. 5329–5333 (2018)
Yeung, G., Alwan, A.: On the difficulties of automatic speech recognition for kindergarten-aged children. In: Interspeech 2018 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Aziz, S., Shahnawazuddin, S. (2023). Enhancing Children’s Short Utterance Based ASV Using Data Augmentation Techniques and Feature Concatenation Approach. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14339. Springer, Cham. https://doi.org/10.1007/978-3-031-48312-7_31
Download citation
DOI: https://doi.org/10.1007/978-3-031-48312-7_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48311-0
Online ISBN: 978-3-031-48312-7
eBook Packages: Computer ScienceComputer Science (R0)