Skip to main content
Log in

Creating Robust Children’s ASR System in Zero-Resource Condition Through Out-of-Domain Data Augmentation

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Developing an automatic speech recognition (ASR) system for children’s speech is extremely challenging due to the unavailability of data from the child domain for the majority of the languages. Consequently, in such zero-resource scenarios, we are forced to develop an ASR system using adults’ speech for transcribing data from child speakers. However, the acoustic mismatch due to differences in formant frequencies and speaking rate between the two groups of speakers results in poor recognition rates as reported in earlier works. To reduce the said mismatch, an out-of-domain data augmentation approach based on formant and time-scale modification is proposed in this work. For that purpose, formant frequencies of adults’ speech data are up-scaled using warping of linear predictive coding coefficients. Next, the speaking rate of adults’ speech data is decreased through time-scale modification. Due to simultaneous altering of formant frequencies and duration of adults’ speech and then pooling the modified data into training, the ill effects of the acoustic mismatch due to the aforementioned factors get reduced. This, in turn, enhances the recognition performance significantly. Additional improvement in recognition rate is obtained by combining the recently reported voice-conversion-based data augmentation technique with the proposed approach. As demonstrated by the experimental evaluations presented in this paper, compared to an adult data trained ASR system, a relative reduction of \(37.6\%\) in word error rate is achieved through data augmentation. Furthermore, the proposed approach yields large reductions in word error rates even under noisy test conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data Availability

The data that support the findings of this study are available from the Linguistic Data Consortium and have been duly cited in this work.

References

  1. Batliner, A., Blomberg, M., D’Arcy, S., Elenius, D., Giuliani, D., Gerosa, M., Hacker, C., Russell, M., Wong, M.: The PF_STAR children’s speech corpus. In: Proc. INTERSPEECH, pp. 2761–2764 (2005)

  2. E.P. Damskägg, V. Välimäki, Audio time stretching using fuzzy classification of spectral bins. Appl. Sci. 7(12) (2017)

  3. S. Das, D. Nix, M. Picheny, Improvements in children’s speech recognition performance. In: Proc. ICASSP 1, 433–436 (1998)

  4. S. Eguchi, I.J. Hirsh, Development of speech sounds in children. Acta Otolaryngol. Suppl. 257, 1–51 (1969)

    Google Scholar 

  5. M. Gerosa, D. Giuliani, S. Narayanan, A. Potamianos, A review of ASR technologies for children’s speech. In: Proc. Workshop on Child, Computer and Interaction, pp. 7:1–7:8 (2009)

  6. M. Gerosa, D. Giuliani, F. Brugnara, Acoustic variability and automatic recognition of children‘s speech. Speech Communun. 49(10–11), 847–860 (2007)

    Article  Google Scholar 

  7. S. Ghai, Addressing Pitch, Mismatch for Children‘s Automatic Speech Recognition. Ph.D. Thesis, Department of EEE, Indian Institute of Technology Guwahati (India, 2011)

  8. S. Ghai, R. Sinha, Exploring the effect of differences in the acoustic correlates of adults‘ and children‘s speech in the context of automatic speech recognition. EURASIP J. Audio Speech Music Process. 7, 1–7 (2010)

    Article  Google Scholar 

  9. J. Huber, E. Stathopoulos, G. Curione, T. Ash, K. Johnson, Formants of children, women, and men: the effects of vocal intensity variation. J. Acoust. Soc. Am. 106, 1532–42 (1999)

    Article  Google Scholar 

  10. T. Kaneko, H. Kameoka, Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint arXiv:1711.11293 (2017)

  11. H. Kathania, M. Singh, T. Gr’osz, M. Kurimo, Data augmentation using prosody and false starts to recognize non-native children’s speech. arXiv preprint arXiv:2008.12914(2020)

  12. H. Kumar Kathania, S. Reddy Kadiri, P. Alku, M. Kurimo, Study of formant modification for children asr. In: Proc. ICASSP, pp. 7429–7433 (2020)

  13. S. Lee, A. Potamianos, S.S. Narayanan, Acoustics of children‘s speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)

    Article  Google Scholar 

  14. J. Makhoul, Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)

    Article  Google Scholar 

  15. V. Peddinti, D. Povey, S. Khudanpur, A time delay neural network architecture for efficient modeling of long temporal contexts. In: Proc. INTERSPEECH (2015)

  16. A. Potaminaos, S. Narayanan, Robust recognition of children speech. IEEE Trans. Speech and Audio Process. 11(6), 603–616 (2003)

    Article  Google Scholar 

  17. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi Speech recognition toolkit. In: Proc. ASRU (2011)

  18. T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, WSJCAM0: a British english speech corpus for large vocabulary continuous speech recognition. Proc. ICASSP 1, 81–84 (1995)

    Google Scholar 

  19. M. Russell, S. D’Arcy, Challenges for computer recognition of children’s speech. In: Proc. Speech and Language Technologies in Education (SLaTE) (2007)

  20. G.P. Scukanec, L. Petrosino, K. Squibb, Formant frequency characteristics of children, young adult, and aged female speakers. Percept. Mot. Skills 73(1), 203–208 (1991)

    Article  Google Scholar 

  21. R. Serizel, D. Giuliani, Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children. Natural Language Engineering (2016)

  22. S. Shahnawazuddin, N. Adiga, K. Kumar, A. Poddar, W. Ahmad, Voice conversion based data augmentation to improve children‘s speech recognition in limited data scenario. In: Proc. INTERSPEECH, pp. 4382–4386 (2020)

  23. S. Shahnawazuddin, A. Dey, R. Sinha, Pitch-adaptive front-end features for robust children’s ASR. In: Proc. INTERSPEECH (2016)

  24. S. Shahnawazuddin, R. Sinha, Low-memory fast on-line adaptation for acoustically mismatched children’s speech recognition. In: Proc. INTERSPEECH (2015)

  25. S. Shahnawazuddin, R. Sinha, G. Pradhan, Pitch-normalized acoustic features for robust children‘s speech recognition. IEEE Signal Process. Lett. 24(8), 1128–1132 (2017)

    Article  Google Scholar 

  26. S. Shahnawazuddin, N. Adiga, H.K. Kathania, Effect of prosody modification on children‘s asr. IEEE Signal Process. Lett. 24(11), 1749–1753 (2017)

    Article  Google Scholar 

  27. P. Sheng, Z. Yang, Y. Qian, Gans for Children: A Generative Data Augmentation Strategy for Children Speech Recognition (2019), pp. 129–135

  28. R. Sinha, S. Shahnawazuddin, Assessment of pitch-adaptive front-end signal processing for children‘s speech recognition. Comput. Speech Lang. 48(Supplement C), 103–121 (2018)

    Article  Google Scholar 

  29. A. Varga, H.J.M. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)

    Article  Google Scholar 

  30. A.H. Waibel, T. Hanazawa, G.E. Hinton, K. Shikano, K.J. Lang, Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 37, 328–339 (1989)

    Article  Google Scholar 

  31. J. Wilpon, C. Jacobsen, A study of speech recognition for children and the elderly. Proc. ICASSP 1, 349–352 (1996)

    Google Scholar 

  32. I.C. Yadav, S. Shahnawazuddin, D. Govind, G. Pradhan, Spectral smoothing by Variational mode Decomposition and its effect on noise and pitch robustness of ASR system. In: Proc. ICASSP (2018)(2018)

  33. G. Yeung, R. Fan, A. Alwan, Fundamental frequency feature normalization and data augmentation for child speech recognition. pp. 6993–6997 (2021)

  34. F. Yu, Z. Yao, X. Wang, K. An, L. Xie, Z. Ou, B. Liu, X. Li, G. Miao, The SLT 2021 children speech recognition challenge: Open datasets, rules and baselines (2021)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vinit Kumar.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kumar, V., Kumar, A. & Shahnawazuddin, S. Creating Robust Children’s ASR System in Zero-Resource Condition Through Out-of-Domain Data Augmentation. Circuits Syst Signal Process 41, 2205–2220 (2022). https://doi.org/10.1007/s00034-021-01885-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-021-01885-5

Keywords

Navigation