Abstract
Developing an automatic speech recognition (ASR) system for children’s speech is extremely challenging due to the unavailability of data from the child domain for the majority of the languages. Consequently, in such zero-resource scenarios, we are forced to develop an ASR system using adults’ speech for transcribing data from child speakers. However, the acoustic mismatch due to differences in formant frequencies and speaking rate between the two groups of speakers results in poor recognition rates as reported in earlier works. To reduce the said mismatch, an out-of-domain data augmentation approach based on formant and time-scale modification is proposed in this work. For that purpose, formant frequencies of adults’ speech data are up-scaled using warping of linear predictive coding coefficients. Next, the speaking rate of adults’ speech data is decreased through time-scale modification. Due to simultaneous altering of formant frequencies and duration of adults’ speech and then pooling the modified data into training, the ill effects of the acoustic mismatch due to the aforementioned factors get reduced. This, in turn, enhances the recognition performance significantly. Additional improvement in recognition rate is obtained by combining the recently reported voice-conversion-based data augmentation technique with the proposed approach. As demonstrated by the experimental evaluations presented in this paper, compared to an adult data trained ASR system, a relative reduction of \(37.6\%\) in word error rate is achieved through data augmentation. Furthermore, the proposed approach yields large reductions in word error rates even under noisy test conditions.
Similar content being viewed by others
Data Availability
The data that support the findings of this study are available from the Linguistic Data Consortium and have been duly cited in this work.
References
Batliner, A., Blomberg, M., D’Arcy, S., Elenius, D., Giuliani, D., Gerosa, M., Hacker, C., Russell, M., Wong, M.: The PF_STAR children’s speech corpus. In: Proc. INTERSPEECH, pp. 2761–2764 (2005)
E.P. Damskägg, V. Välimäki, Audio time stretching using fuzzy classification of spectral bins. Appl. Sci. 7(12) (2017)
S. Das, D. Nix, M. Picheny, Improvements in children’s speech recognition performance. In: Proc. ICASSP 1, 433–436 (1998)
S. Eguchi, I.J. Hirsh, Development of speech sounds in children. Acta Otolaryngol. Suppl. 257, 1–51 (1969)
M. Gerosa, D. Giuliani, S. Narayanan, A. Potamianos, A review of ASR technologies for children’s speech. In: Proc. Workshop on Child, Computer and Interaction, pp. 7:1–7:8 (2009)
M. Gerosa, D. Giuliani, F. Brugnara, Acoustic variability and automatic recognition of children‘s speech. Speech Communun. 49(10–11), 847–860 (2007)
S. Ghai, Addressing Pitch, Mismatch for Children‘s Automatic Speech Recognition. Ph.D. Thesis, Department of EEE, Indian Institute of Technology Guwahati (India, 2011)
S. Ghai, R. Sinha, Exploring the effect of differences in the acoustic correlates of adults‘ and children‘s speech in the context of automatic speech recognition. EURASIP J. Audio Speech Music Process. 7, 1–7 (2010)
J. Huber, E. Stathopoulos, G. Curione, T. Ash, K. Johnson, Formants of children, women, and men: the effects of vocal intensity variation. J. Acoust. Soc. Am. 106, 1532–42 (1999)
T. Kaneko, H. Kameoka, Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint arXiv:1711.11293 (2017)
H. Kathania, M. Singh, T. Gr’osz, M. Kurimo, Data augmentation using prosody and false starts to recognize non-native children’s speech. arXiv preprint arXiv:2008.12914(2020)
H. Kumar Kathania, S. Reddy Kadiri, P. Alku, M. Kurimo, Study of formant modification for children asr. In: Proc. ICASSP, pp. 7429–7433 (2020)
S. Lee, A. Potamianos, S.S. Narayanan, Acoustics of children‘s speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)
J. Makhoul, Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)
V. Peddinti, D. Povey, S. Khudanpur, A time delay neural network architecture for efficient modeling of long temporal contexts. In: Proc. INTERSPEECH (2015)
A. Potaminaos, S. Narayanan, Robust recognition of children speech. IEEE Trans. Speech and Audio Process. 11(6), 603–616 (2003)
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi Speech recognition toolkit. In: Proc. ASRU (2011)
T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, WSJCAM0: a British english speech corpus for large vocabulary continuous speech recognition. Proc. ICASSP 1, 81–84 (1995)
M. Russell, S. D’Arcy, Challenges for computer recognition of children’s speech. In: Proc. Speech and Language Technologies in Education (SLaTE) (2007)
G.P. Scukanec, L. Petrosino, K. Squibb, Formant frequency characteristics of children, young adult, and aged female speakers. Percept. Mot. Skills 73(1), 203–208 (1991)
R. Serizel, D. Giuliani, Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children. Natural Language Engineering (2016)
S. Shahnawazuddin, N. Adiga, K. Kumar, A. Poddar, W. Ahmad, Voice conversion based data augmentation to improve children‘s speech recognition in limited data scenario. In: Proc. INTERSPEECH, pp. 4382–4386 (2020)
S. Shahnawazuddin, A. Dey, R. Sinha, Pitch-adaptive front-end features for robust children’s ASR. In: Proc. INTERSPEECH (2016)
S. Shahnawazuddin, R. Sinha, Low-memory fast on-line adaptation for acoustically mismatched children’s speech recognition. In: Proc. INTERSPEECH (2015)
S. Shahnawazuddin, R. Sinha, G. Pradhan, Pitch-normalized acoustic features for robust children‘s speech recognition. IEEE Signal Process. Lett. 24(8), 1128–1132 (2017)
S. Shahnawazuddin, N. Adiga, H.K. Kathania, Effect of prosody modification on children‘s asr. IEEE Signal Process. Lett. 24(11), 1749–1753 (2017)
P. Sheng, Z. Yang, Y. Qian, Gans for Children: A Generative Data Augmentation Strategy for Children Speech Recognition (2019), pp. 129–135
R. Sinha, S. Shahnawazuddin, Assessment of pitch-adaptive front-end signal processing for children‘s speech recognition. Comput. Speech Lang. 48(Supplement C), 103–121 (2018)
A. Varga, H.J.M. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
A.H. Waibel, T. Hanazawa, G.E. Hinton, K. Shikano, K.J. Lang, Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 37, 328–339 (1989)
J. Wilpon, C. Jacobsen, A study of speech recognition for children and the elderly. Proc. ICASSP 1, 349–352 (1996)
I.C. Yadav, S. Shahnawazuddin, D. Govind, G. Pradhan, Spectral smoothing by Variational mode Decomposition and its effect on noise and pitch robustness of ASR system. In: Proc. ICASSP (2018)(2018)
G. Yeung, R. Fan, A. Alwan, Fundamental frequency feature normalization and data augmentation for child speech recognition. pp. 6993–6997 (2021)
F. Yu, Z. Yao, X. Wang, K. An, L. Xie, Z. Ou, B. Liu, X. Li, G. Miao, The SLT 2021 children speech recognition challenge: Open datasets, rules and baselines (2021)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kumar, V., Kumar, A. & Shahnawazuddin, S. Creating Robust Children’s ASR System in Zero-Resource Condition Through Out-of-Domain Data Augmentation. Circuits Syst Signal Process 41, 2205–2220 (2022). https://doi.org/10.1007/s00034-021-01885-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-021-01885-5