Skip to main content
Log in

An Experimental Study on the Significance of Variable Frame-Length and Overlap in the Context of Children’s Speech Recognition

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

It is well known that the recognition performance of an automatic speech recognition (ASR) system is affected by intra-speaker as well inter-speaker variability. The differences in the geometry of vocal organs, pitch and speaking-rate among the speakers are some such inter-speaker variabilities affecting the recognition performance. A mismatch between the training and test data with respect to any of those aforementioned factors leads to increased error rates. An example of acoustically mismatched ASR is the task of transcribing children’s speech on adult data-trained system. A large number of studies have been reported earlier that present a myriad of techniques for addressing acoustic mismatch arising from differences in pitch and dimensions of vocal organs. At the same time, only a few works on speaking-rate adaptation employing timescale modification have been reported. Furthermore, those studies were performed on ASR systems developed using Gaussian mixture models. Motivated by these facts, speaking-rate adaptation is explored in this work in the context of children’s ASR system employing deep neural network-based acoustic modeling. Speaking-rate adaptation is performed by changing the frame-length and overlap during front-end feature extraction process. Significant reductions in errors are noted by speaking-rate adaptation. In addition to that, we have also studied the effect of combining speaking-rate adaptation with vocal-tract length normalization and explicit pitch modification. In both the cases, additive improvements are obtained. To summarize, relative improvements in 15–20% over the baselines are obtained by varying the frame-length and frame-overlap.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  1. A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker, M. Russell, M. Wong, The PF_STAR children’s speech corpus. In Proceedings INTERSPEECH (2005), pp. 2761–2764

  2. G.T. Beauregard, X. Zhu, L. Wyse, An efficient algorithm for real-time spectrogram inversion. In Procedings of the 8th International Conference on Digital Audio Effects (2005), pp. 116–118

  3. D. Burnett, M. Fanty, Rapid unsupervised adaptation to children’s speech on a connected-digit task. In Proceedings ICSLP, vol. 2 (1996), pp. 1145–1148

  4. J.P. Cabral, L.C. Oliveira, Pitch-synchronous time-scaling for prosodic and voice quality transformations. In Proceedings INTERSPEECH (2005), pp. 1137–1140

  5. S.M. Chu, D. Povey, Speaking rate adaptation using continuous frame rate normalization. In Proceedings ICASSP (2010), pp. 4306–4309

  6. G. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Speech Audio Process. 20(1), 30–42 (2012)

    Article  Google Scholar 

  7. S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980). https://doi.org/10.1109/TASSP.1980.1163420

    Article  Google Scholar 

  8. M. Gerosa, D. Giuliani, S. Narayanan, A. Potamianos, A review of ASR technologies for children’s speech. In Proceedings of Workshop on Child, Computer and Interaction (2009), pp. 7:1–7:8

  9. S. Ghai, Addressing pitch mismatch for children’s automatic speech recognition. Ph.D. thesis, Department of EEE, Indian Institute of Technology Guwahati, India (2011)

  10. A. Hagen, B. Pellom, R. Cole, Children’s speech recognition with application to interactive books and tutors. In Proceedings ASRU (2003), pp. 186–191

  11. A. Hagen, B. Pellom, R. Cole, Highly accurate childrens speech recognition for interactive reading tutors using subword units. Speech Commun. 49(12), 861–873 (2007)

    Article  Google Scholar 

  12. R. Kent, L. Forner, Speech segment durations in sentence recitations by children and adults. J. Phonet. 8, 157–168 (1980)

    Google Scholar 

  13. L. Lee, R. Rose, A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)

    Article  Google Scholar 

  14. S. Lee, A. Potamianos, S.S. Narayanan, Analysis of children’s speech: duration, pitch and formants. In Proceedings INTERSPEECH, vol. 1 (1997), p. 473–476

  15. S. Lee, A. Potamianos, S.S. Narayanan, Acoustics of childrens speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)

    Article  Google Scholar 

  16. H. Liao, G. Pundak, O. Siohan, M.K. Carroll, N. Coccaro, Q. Jiang, T.N. Sainath, A.W. Senior, F. Beaufays, M. Bacchiani, Large vocabulary automatic speech recognition for children. In Proceedings INTERSPEECH (2015), pp. 1611–1615

  17. J.L. Miller, Effects of speaking rate on segmental distinctions. In Perspectives on the study of speech (1981), pp. 39–71

  18. J.L. Miller, L.E. Volaitis, Effect of speaking rate on the perceptual structure of a phonetic category. Percept. Psychophys. 46(6), 505–512 (1989)

    Article  Google Scholar 

  19. N. Mirghafori, E. Fosler, N. Morgan, Towards robustness to fast speech in ASR. In Proceedings ICASSP, vol. 1 (1996), pp. 335–338

  20. N. Morgan, E. Fosler, N. Mirghafori, Speech recognition using on-line estimation of speaking rate. In Proceedings EUROSPEECH (1997), pp. 2079–2082

  21. S.H. ParthasarathiK., B. Hoffmeister, S. Matsoukas, A. Mandal, N. Strom, S. Garimella, fMLLR based feature-space speaker adaptation of DNN acoustic models. In INTERSPEECH (2015)

  22. A. Potaminaos, S. Narayanan, Robust recognition of children speech. IEEE Trans. Speech Audio Process. 11(6), 603–616 (2003)

    Article  Google Scholar 

  23. D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow, R.C. Rose, P. Schwarz, S. Thomas, The subspace Gaussian mixture model—a structured model for speech recognition. Comput. Speech Lang. 25(2), 404–439 (2011)

    Article  Google Scholar 

  24. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi speech recognition toolkit. In Proceedings ASRU (2011)

  25. L. Rabiner, B.H. Juang, Fundamentals of Speech Recognition (Prentice-Hall Inc, Upper Saddle River, NJ, 1993)

    Google Scholar 

  26. T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition. In Proceedings ICASSP, vol. 1 (1995), pp. 81–84

  27. M. Russell, S. D’Arcy, Challenges for computer recognition of children’s speech. In Proceedings Speech and Language Technologies in Education (SLaTE) (2007)

  28. J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba, M. Cohen, M. Kamvar, B. Strope, Your word is my command: Google search by voice: a case study. In Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics, chap. 4 (2010), pp. 61–90

  29. R. Serizel, D. Giuliani, Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children. Nat. Lang. Eng. 23, 325–350 (2016)

    Article  Google Scholar 

  30. S. Shahnawazuddin, K.T. Deepak, G. Pradhan, R. Sinha, Enhancing noise and pitch robustness of children’s ASR. In Proceedings ICASSP (2017), pp. 5225–5229

  31. S. Shahnawazuddin, A. Dey, R. Sinha, Pitch-adaptive front-end features for robust children’s ASR. In Proceedings INTERSPEECH (2016), pp. 3459–3463

  32. S. Shahnawazuddin, R. Sinha, Sparse coding over redundant dictionaries for fast adaptation of speech recognition system. Comput. Speech Lang. 43, 1–17 (2017)

    Article  Google Scholar 

  33. S. Shahnawazuddin, R. Sinha, G. Pradhan, Pitch-normalized acoustic features for robust children’s speech recognition. IEEE Signal Process. Lett. 24(8), 1128–1132 (2017)

    Article  Google Scholar 

  34. X. Shao, B. Milner, Pitch prediction from MFCC vectors for speech reconstruction. In Proceedings ICASSP (2004), pp. 97–100

  35. M.A. Siegler, R.M. Stern, On the effects of speech rate in large vocabulary speech recognition systems. In Proceedings ICASSP, vol. 1 (1995), pp. 612–615

  36. H. Singer, S. Sagayama, Pitch dependent phone modelling for HMM based speech recognition. In Proceedings ICASSP (1992), pp. 273–276

  37. G. Stemmer, C. Hacker, S. Steidl, E. Nöth, Acoustic normalization of childrens speech. In Proceedings INTERSPEECH (2003), pp. 1313–1316

  38. Q. Summerfield, Articulatory rate and perceptual constancy in phonetic perception. J. Exp. Psychol. Hum. Perform. Percept. 7, 208–215 (1981)

    Article  Google Scholar 

  39. Z.H. Tan, B. Lindberg, Low-complexity variable frame rate analysis for speech recognition and voice activity detection. IEEE J. Sel. Top. Signal Process. 4(5), 798–807 (2010)

    Article  Google Scholar 

  40. D.L. Valente, H.M. Plevinsky, J.M. Franco, E.C. Heinrichs-Graham, D. Lewis, Experimental investigation of the effects of the acoustical conditions in a simulated classroom on speech recognition and learning in children. J. Acoust. Soc. Am. 131(1), 232–246 (2012)

    Article  Google Scholar 

  41. S. Whiteside, C. Hodgson, Speech patterns of children and adults elicited via a picture-naming task: an acoustic study. Speech Commun. 32(4), 267–285 (2000)

    Article  Google Scholar 

  42. J. Wilpon, C. Jacobsen, A study of speech recognition for children and the elderly. In Proceedings ICASSP, vol. 1 (1996), pp. 349–352

  43. P.C. Woodland, Speaker adaptation for continuos density HMMs: a review. In Proceedings ISCA ITRW on Adaptation Methods for Speech Recognition (2001), pp. 11–19

  44. H. You, Q. Zhu, A. Alwan, Entropy-based variable frame rate analysis of speech signals and its application to ASR. In Proceedings ICASSP, vol. 1 (2004), pp. 549–522

  45. X. Zhang, J. Trmal, D. Povey, S. Khudanpur, Improving deep neural network acoustic models using generalized maxout networks. In Proceedings ICASSP (2014), pp. 215–219

  46. X. Zhu, G.T. Beauregard, L.L. Wyse, Real-time signal estimation from modified short-time Fourier transform magnitude spectra. IEEE Trans. Audio Speech Lang. Process. 15(5), 1645–1653 (2007)

    Article  Google Scholar 

Download references

Acknowledgements

The authors express sincere gratitude to the anonymous reviewers for their thoughtful comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Shahnawazuddin.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shahnawazuddin, S., Singh, C., Kathania, H.K. et al. An Experimental Study on the Significance of Variable Frame-Length and Overlap in the Context of Children’s Speech Recognition. Circuits Syst Signal Process 37, 5540–5553 (2018). https://doi.org/10.1007/s00034-018-0828-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-018-0828-2

Keywords

Navigation