Skip to main content

Enhancing Children’s Short Utterance Based ASV Using Data Augmentation Techniques and Feature Concatenation Approach

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2023)

Abstract

The task of developing an automatic speaker verification (ASV) system for children’s speech is a challenging one due to a number of reasons. The dearth of domain-specific data is one among them. The challenge further intensifies with the introduction of short utterances of speech, a relatively unexplored domain in the case of children’s ASV. To circumvent the issue arising due to data scarcity, the work in this paper extensively explores various in-domain and out-of-domain data augmentation techniques. A data augmentation approach is proposed that encompasses both in-domain and out-of-domain data augmentation techniques. The out-of-domain data used are from adult speakers which are known to have acoustic attributes in stark contrast to child speakers. Consequently, various techniques like prosody modification, formant modification and voice-conversion are employed in order to modify the adult acoustic features and render it acoustically similar to children’s speech prior to augmentation. The in-domain data augmentation approach, on the other hand, involved speed perturbation of children’s speech. The proposed data augmentation approach helps not only in increasing the amount of training data but also in effectively capturing the missing target attributes which helps in boosting the verification performance. A relative improvement of \(43.91\%\) in equal error rate (EER) with respect to the baseline system is a testimony of it. Furthermore, the commonly used Mel-frequency cepstral coefficients (MFCC) average out the higher-frequency components due to the larger bandwidth of the filter-bank. Therefore, effective preservation of higher-frequency contents in children’s speech is another challenge which must be appropriately tackled for the development of a reliable and robust children’stion techniques and Feature Concatenation A ASV system. The feature concatenation of MFCC and IMFCC is carried out with the sole intention of effectively preserving the higher-frequency contents in the children’s speech data. The feature concatenation approach, when combined with proposed data augmentation, helps in further improvement of the verification performance and results in an overall relative reduction of \(48.51\%\) for equal error rate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Badillo-Urquiola, K., Smriti, D., McNally, B., Golub, E., Bonsignore, E., Wisniewski, P.J.: Stranger danger! social media app features co-designed with children to keep them safe online. In: Proceedings of the 18th ACM International Conference on Interaction Design and Children, pp. 394–406 (2019)

    Google Scholar 

  2. D’Arcy, S., Russell, M.: A comparison of human and computer recognition accuracy for children’s speech. In: Ninth European Conference on Speech Communication and Technology (2005)

    Google Scholar 

  3. Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980). https://doi.org/10.1109/TASSP.1980.1163420

    Article  Google Scholar 

  4. Eskenazi, M., Mostow, J., Graff, D.: The CMU Kids Corpus LDC97S63 (1997). https://catalog.ldc.upenn.edu/LDC97S63

  5. Gerosa, M., Giuliani, D., Narayanan, S., Potamianos, A.: A review of ASR technologies for children’s speech. In: Proceedings of the Workshop on Child, Computer and Interaction, pp. 7:1–7:8 (2009)

    Google Scholar 

  6. Hanifa, R.M., Isa, K., Mohamad, S.: A review on speaker recognition: technology and challenges. Comput. Electr. Eng. 90, 107005 (2021)

    Article  Google Scholar 

  7. Kaneko, T., Kameoka, H.: Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint arXiv:1711.11293 (2017)

  8. Kathania, H.K., Kadiri, S.R., Alku, P., Kurimo, M.: Study of formant modification for children ASR. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7429–7433. IEEE (2020)

    Google Scholar 

  9. Kathania, H.K., Shahnawazuddin, S., Ahmad, W., Adiga, N.: Role of linear, mel and inverse-mel filterbanks in automatic recognition of speech from high-pitched speakers. Circuits Syst. Sig. Process. 38(10), 4667–4682 (2019)

    Article  Google Scholar 

  10. Kumar, V., Kumar, A., Shahnawazuddin, S.: Creating robust children’s ASR system in zero-resource condition through out-of-domain data augmentation. Circuits Syst. Signal Process. 41(4), 2205–2220 (2022)

    Article  Google Scholar 

  11. Lee, S., Potamianos, A., Narayanan, S.S.: Acoustics of children’s speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)

    Article  Google Scholar 

  12. Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Proceedings of the INTERSPEECH (2015)

    Google Scholar 

  13. Poddar, A., Sahidullah, M., Saha, G.: Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biometrics 7(2), 91–101 (2018)

    Article  Google Scholar 

  14. Povey, D., et al.: The kaldi speech recognition toolkit. In: Proceedings of the ASRU (2011)

    Google Scholar 

  15. Povey, D., Zhang, X., Khudanpur, S.: Parallel training of deep neural networks with natural gradient and parameter averaging. In: Proceedings of the ICLR (2015)

    Google Scholar 

  16. Prasanna, S.R.M., Govind, D., Rao, K.S., Yegnanarayana, B.: Fast prosody modification using instants of significant excitation. In: Proceedings of the International Conference on Speech Prosody (2010)

    Google Scholar 

  17. Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S.: WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition. In: Proceedings of the ICASSP, vol. 1, pp. 81–84 (1995)

    Google Scholar 

  18. Russell, M., D’Arcy, S.: Challenges for computer recognition of children’s speech. In: Proceedings of the Speech and Language Technologies in Education (SLaTE) (2007)

    Google Scholar 

  19. Russell, M., D’Arcy, S., Qun, L.: The effects of bandwidth reduction on human and computer recognition of children’s speech. IEEE Sig. Process. Lett. 14(12), 1044–1046 (2007)

    Article  Google Scholar 

  20. Safavi, S., Russell, M., Jancovic, P.: Automatic speaker, age-group and gender identification from children’s speech. Comput. Speech Lang. 50, 141–156 (2018)

    Article  Google Scholar 

  21. Shahnawazuddin, S., Adiga, N., Kathania, H.K., Sai, B.T.: Creating speaker independent ASR system through prosody modification based data augmentation. Pattern Recogn. Lett. 131, 213–218 (2020). https://doi.org/10.1016/j.patrec.2019.12.019

    Article  Google Scholar 

  22. Shahnawazuddin, S., Adiga, N., Sai, B.T., Ahmad, W., Kathania, H.K.: Developing speaker independent ASR system using limited data through prosody modification based on fuzzy classification of spectral bins. Digit. Sig. Process. 93, 34–42 (2019)

    Article  Google Scholar 

  23. Shahnawazuddin, S., Ahmad, W., Adiga, N., Kumar, A.: In-domain and out-of-domain data augmentation to improve children’s speaker verification system in limited data scenario. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7554–7558 (2020)

    Google Scholar 

  24. Shobaki, K., Hosom, J.P., Cole, R.: CSLU: kids’ speech version 1.1. Linguistic Data Consortium (2007)

    Google Scholar 

  25. Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. In: Proceedings of the INTERSPEECH, pp. 999–1003 (2017)

    Google Scholar 

  26. Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-Vectors: Robust DNN embeddings for speaker recognition. In: Proceedings of the ICASSP, pp. 5329–5333 (2018)

    Google Scholar 

  27. Yeung, G., Alwan, A.: On the difficulties of automatic speech recognition for kindergarten-aged children. In: Interspeech 2018 (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shahid Aziz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Aziz, S., Shahnawazuddin, S. (2023). Enhancing Children’s Short Utterance Based ASV Using Data Augmentation Techniques and Feature Concatenation Approach. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14339. Springer, Cham. https://doi.org/10.1007/978-3-031-48312-7_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-48312-7_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-48311-0

  • Online ISBN: 978-3-031-48312-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics