Skip to main content
Log in

A Fast Adaptation Approach for Enhanced Automatic Recognition of Children’s Speech with Mismatched Acoustic Models

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

This study explores issues in automatic speech recognition (ASR) of children’s speech on acoustic models trained using adults’ speech. For acoustic modeling in ASR, the employed front-end features capture the characteristics of the vocal filter while smoothing out those of the source (excitation). Adults’ and children’s speech differ significantly due to large deviation in the acoustic correlates such as pitch, formants, speaking rate, etc. In the context of children’s speech recognition on mismatched acoustic models, the recognition rates remain highly degraded despite use of the vocal tract length normalization (VTLN) for addressing formant mismatch. For commonly used mel-filterbank-based cepstral features, earlier studies have shown that the acoustic mismatch is exacerbated by insufficient smoothing of pitch harmonics for child speakers. To address this problem, a structured low-rank projection of the test features as well as that of the mean and the covariance parameters of the acoustic models was explored in an earlier work. In this paper, a low-latency adaptation scheme is presented for children’s mismatched ASR. The presented fast adaptation approach exploits the earlier reported low-rank projection technique in order to reduce the computational cost. In the proposed approach, developmental data from the children’s domain is partitioned into separate groups on the basis of their estimated VTLN warp factors. A set of adapted acoustic models is then created by combining the low-rank projection with the model space adaptation technique for each of the warp factors. Given the children’s test utterance, first an appropriate pre-adapted model mean supervector is chosen based on its estimated warp factor. The chosen supervector is then optimally scaled. Consequently, only two parameters are required to be estimated, i.e., a warp factor and a model mean scaling factor. Even with such stringent constraints, the proposed adaptation technique results in a relative improvement of about \(44\%\) over the VTLN included baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. An initial version of this work was presented at Interspeech 2015 [34].

References

  1. O. Abdel-Hamid, H. Jiang: Rapid and Effective Speaker Adaptation of Convolutional Neural Network Based Models for Speech Recognition. In: Proceedings INTERSPEECH, pp. 1248–1252 (2013)

  2. A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker, M. Russell, M. Wong: The PF_STAR Children’s Speech Corpus. In: Proceedings of INTERSPEECH, pp. 2761–2764 (2005)

  3. L. Bell, J. Gustafson, Children’s Convergence in Referring Expressions to Graphical Objects in a Speech-Enabled Computer Game. In: Proceedings of INTERSPEECH, pp. 2209–2212 (2007)

  4. D. Burnett, M. Fanty, Rapid unsupervised adaptation to children’s speech on a connected-digit task. Proc. ICSLP 2, 1145–1148 (1996)

    Google Scholar 

  5. S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoustic, Speech. Signal Process. 28(4), 357–366 (1980). doi:10.1109/TASSP.1980.1163420

    Google Scholar 

  6. V. Digalakis, D. Rtischev, L. Neumeyer, Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans. Speech Audio Process. 3, 357–366 (1995)

    Article  Google Scholar 

  7. S. Evans, N. Neave, D. Wakelin, Relationships between vocal characteristics and body size and shape in human males: an evolutionary explanation for a deep male voice. Biol. Psychol. 72(2), 160–163 (2006)

    Article  Google Scholar 

  8. J. Fainberg, Improving Children’s Speech Recognition Through Out of Domain Data Augmentation. Master’s thesis, School of Informatics University of Edinburgh (2015)

  9. W.M. Fisher, G.R. Doddington, K.M. Goudie-Marshall, The DARPA Speech Recognition Research Database: Specifications and Status. In: Proceedings of DARPA Workshop on Speech Recognition, pp. 93–99 (1986)

  10. M.J.F. Gales, Cluster adaptive training of hidden Markov models. IEEE Trans. Speech Audio Process. 8(4), 417–428 (1999)

    Article  Google Scholar 

  11. J.L. Gauvain, C.H. Lee, Maximum a-posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2, 291–298 (1994)

    Article  Google Scholar 

  12. S. Ghai, Addressing Pitch Mismatch for Children’s Automatic Speech Recognition. Ph.D. thesis, Department of EEE, Indian Institute of Technology Guwahati, India (2011)

  13. S. Ghai, R. Sinha, Exploring the Role of Spectral Smoothing in Context of Children’s Speech Recognition. In: Proceedings of INTERSPEECH, pp. 1607–1610 (2009)

  14. S. Ghai, R. Sinha, Exploring the effect of differences in the acoustic correlates of adults’ and children’s speech in the context of automatic speech recognition. EURASIP J. Audio Speech Music Process. 2010, 7:1–7:15 (2010)

  15. J. González, Formant frequencies and body size of speaker: a weak relationship in adult humans. J. Phon. 32(2), 277–287 (2004)

    Article  Google Scholar 

  16. S.S. Gray, D. Willett, J. Pinto, J. Lu, P. Maergner, N. Bodenstab, Child Automatic Speech Recognition for US English: Child Interaction with Living-Room-Electronic-Devices. In: Proceedings of INTERSPEECH, Workshop on Child, Computer and Interaction (2014)

  17. J. Gustafson, K. Sjolander, Voice Transformations for Improving Children’s Speech Recognition in a Publicly Available Dialogue System. In: Proceedings of ICSLP, pp. 297–300 (2002)

  18. A. Hagen, B. Pellom, R. Cole, Children’s Speech Recognition with Application to Interactive Books and Tutors. In: Proceedings of ASRU Workshop, pp. 186–191 (2003)

  19. A. Hagen, B. Pellom, R. Cole, Highly accurate children’s speech recognition for interactive reading tutors using subword units. Speech Commun. 49(12), 861–873 (2007)

    Article  Google Scholar 

  20. T.J. Hazen, J.R. Glass, A Comparison of Novel Techniques for Instantaneous Speaker Adaptation. In: Proceedings of European Conference on Speech Communication and Technology, pp. 2047–2050 (1997)

  21. I.T. Jolliffe, Principal Component Analysis (Springer, Berlin, 1986)

    Book  MATH  Google Scholar 

  22. H.K. Kathania, S. Shahnawazuddin, R. Sinha, Exploring HLDA Based Transformation for Reducing Acoustic Mismatch in Context of Children Speech Recognition. In: Proceedings of International Conference on Signal Processing and Communications (SPCOM), pp. 1–5 (2014)

  23. R. Kuhn, J.C. Junqua, P. Nguyen, N. Niedzielski, Rapid speaker adaptation in eigenvoice space. IEEE Trans. Speech Audio Process. 8(6), 695–707 (2000)

    Article  Google Scholar 

  24. N. Kumar, A.G. Andreou, Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Commun. 26(4), 283–297 (1998)

    Article  Google Scholar 

  25. L. Lee, R. Rose, A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)

    Article  Google Scholar 

  26. C.J. Leggetter, P.C. Woodland, Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech Lang. 9, 171–185 (1995)

    Article  Google Scholar 

  27. H. Liao, Speaker Adaptation of Context Dependent Deep Neural Networks. In: Proceedings of ICASSP, pp. 7947–7951 (2013)

  28. R. Nisimura, A. Lee, H. Saruwatari, K. Shikano, Public speech-oriented guidance system with adult and child discrimination capability. Proc. ICASSP 1, 433–436 (2004)

    Google Scholar 

  29. A. Potamianos, S. Narayanan, Robust recognition of children speech. IEEE Trans. Speech and Audio Process. 11(6), 603–616 (2003)

    Article  Google Scholar 

  30. Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S.: WSJCAM0: A British English Speech Corpus for Large Vocabulary Continuous Speech Recognition. In: Proceedings of ICASSP, pp. 81–85 (1995)

  31. J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba, M. Cohen, M. Kamvar, B. Strope, Your word is my command: Google search by voice: a case study. Adv. Speech Recogn. Mobile Environ. Call Centers Clin. chap. 4, 61–90 (2010)

    Google Scholar 

  32. S. Shahnawazuddin, H. Kathania, R. Sinha, Enhancing the Recognition of Children’s Speech on Acoustically Mismatched ASR System. In: Proceedings IEEE TENCON, 2015

  33. S. Shahnawazuddin, R. Sinha, Improved bases selection in acoustic model interpolation for fast on-line adaptation. IEEE Signal Process. Lett. 21(4), 493–497 (2014)

    Article  Google Scholar 

  34. S. Shahnawazuddin, R. Sinha, Low-Memory Fast On-Line Adaptation for Acoustically Mismatched Children’s Speech Recognition. In: Proceedings of INTERSPEECH, pp. 1630–1634 (2015)

  35. Shao, X., Milner, B.: Pitch Prediction from MFCC Vectors for Speech Reconstruction. In: Proceedings of ICASSP, pp. 97–100 (2004)

  36. Singer, H., Sagayama, S.: Pitch Dependent Phone Modelling for HMM Based Speech Recognition. In: Proceedings of ICASSP, pp. 273–276 (1992)

  37. R. Sinha, S. Ghai, On the Use of Pitch Normalization for Improving Children’s Speech Recognition. In: Proceedings of INTERSPEECH, pp. 568–571 (2009)

  38. T. Tan, Y. Qian, K. Yu, Cluster adaptive training for deep neural network based acoustic model. IEEE/ACM Trans. Audio Speech Lang. Process 24(3), 459–468 (2016)

    Article  Google Scholar 

  39. P.C. Woodland, Speaker Adaptation for Continuos Density HMMs: A Review. In: Proceedings of ISCA Tutorial and Research Workshop on Adaptatation Methods for Speech Recognition, pp. 11–19 (2001)

  40. S. Xue, O. Abdel-Hamid, H. Jiang, L. Dai, Q. Liu, Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process 22(12), 1713–1725 (2014)

    Article  Google Scholar 

  41. K. Yao, D. Yu, F. Seide, H. Su, L. Deng, Y. Gong, Adaptation of Context-Dependent Deep Neural Networks for Automatic Speech Recognition. In: Proceedings of SLT, pp. 366–369 (2012)

  42. Young, S., Evermann, G., Gales, M.J.F., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book version 3.4. CUED, Cambridge, U.K. (2006)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Shahnawazuddin.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shahnawazuddin, S., Sinha, R. A Fast Adaptation Approach for Enhanced Automatic Recognition of Children’s Speech with Mismatched Acoustic Models. Circuits Syst Signal Process 37, 1098–1115 (2018). https://doi.org/10.1007/s00034-017-0586-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-017-0586-6

Keywords

Navigation