Skip to main content
Log in

Explicit Pitch Mapping for Improved Children’s Speech Recognition

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Recognizing children’s speech on automatic speech recognition (ASR) systems developed using adults’ speech is a very challenging task. As reported by several earlier works, a severely degraded recognition performance is observed in such ASR tasks. This is mainly due to the gross mismatch in the acoustic and linguistic attributes between those two groups of speakers. One among the various identified sources of mismatch is that the vocal organs of the adult and child speakers are of significantly different dimensions. Feature-space normalization techniques are noted to effectively address the ill-effects arising from those differences. Two most commonly used approaches are the vocal-tract length normalization and the feature-space maximum-likelihood linear regression. Another important mismatch factor is the large variation in the average pitch values across the adult and child speakers. Addressing the ill-effects introduced by the pitch differences is the primary focus of the presented study. In this regard, we have explored the feasibility of explicitly changing the pitch of the children’s speech so that observed pitch differences between the two groups of speaker are reduced. In general, speech data from children is high-pitched in comparison with that from the adults’. Consequently, in this study, the pitch of the adults’ speech used for training the ASR system is kept unchanged while that for the children’s test speech data is reduced. Significant improvement in the recognition performance is noted by this explicit reduction of pitch. To conserve the critical spectral information and to avoid introducing perceptual artifacts, we have exploited timescale modification techniques for explicit pitch mapping. Furthermore, we also presented two schemes to automatically determine the factor by which the pitch of the given test data should be varied. Automatically determining the compensation factor is critical since an ASR system is expected to be accessed by both adult and child speakers. The effectiveness of proposed techniques is evaluated on adult data trained ASR systems employing different acoustic modeling approaches, viz. Gaussian mixture modeling (GMM), subspace GMM and deep neural networks (DNN). The proposed techniques are found to be highly effective in all the explored modeling paradigms. To further study the effectiveness of the proposed approaches, another DNN-based ASR system is developed on a mix of speech data from adult as well as child speakers. The use of pitch reduction is observed to be effective even in this case.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. It is to note that neither CMVN nor fMLLR is applied to the MFCC features used for the analyses reported here even though both CMVN and fMLLR will be used prior to training model parameters.

References

  1. T. Anastasakos, J. Mcdonough, R. Schwartz, J. Makhoul, A compact model for speaker-adaptive training, in International Conference on Spoken Language Processing, vol. 2. (1996), pp. 1137–1140

  2. A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker, M. Russell, M. Wong, The PF_STAR children’s speech corpus, in Proceedings of INTERSPEECH (2005), pp. 2761–2764

  3. G.T. Beauregard, X. Zhu, L. Wyse, An efficient algorithm for real-time spectrogram inversion, in Proceedings of The 8th International Conference on Digital Audio Effects (2005), pp. 116–118

  4. L. Bell, J. Gustafson, Children’s convergence in referring expressions to graphical objects in a speech-enabled computer game, in Proceeding of INTERSPEECH (2007), pp. 2209–2212

  5. D. Burnett, M. Fanty, Rapid unsupervised adaptation to children’s speech on a connected-digit task, in Proceedings of ICSLP 2 (1996), pp. 1145–1148

  6. G. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Speech Audio Process. 20(1), 30–42 (2012)

    Article  Google Scholar 

  7. S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980). doi:10.1109/TASSP.1980.1163420

    Article  Google Scholar 

  8. J.R. Deller Jr., J.H.L. Hansen, J.G. Proakis, Discrete-Time Processing of Speech Signals, 2nd edn. (IEEE Press, New York, 2000)

    Google Scholar 

  9. V. Digalakis, D. Rtischev, L. Neumeyer, Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans. Speech Audio Process. 3, 357–366 (1995)

    Article  Google Scholar 

  10. J. Driedger, M. Müller, A review of time-scale modification of music signals. Appl. Sci. 6(2), 57 (2016)

    Article  Google Scholar 

  11. J. Driedger, M. Müller, Tsm toolbox: Matlab implementations of time-scale modification algorithms, in Proceeding of the International Conference on Digital Audio Effects (DAFx), Erlangen, Germany (2014), pp. 249–256

  12. J. Driedger, M. Müller, S. Ewert, Improving time-scale modification of music signals using harmonic-percussive separation. IEEE Signal Process. Lett. 21(1), 105–109 (2014)

    Article  Google Scholar 

  13. R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd edn. (Wiley-Interscience, Hoboken, 2000)

    MATH  Google Scholar 

  14. W.M. Fisher, G.R. Doddington, K.M. Goudie-Marshall, The DARPA speech recognition research database: specifications and status, in Proceedings of the DARPA Workshop on Speech Recognition (1986). pp. 93–99

  15. M.J.F. Gales, Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process. 7(3), 272–281 (1999)

    Article  Google Scholar 

  16. M. Gerosa, D. Giuliani, F. Brugnara, Acoustic variability and automatic recognition of children’s speech. Speech Communun. 49(10–11), 847–860 (2007)

    Article  Google Scholar 

  17. M. Gerosa, D. Giuliani, S. Narayanan, A. Potamianos, A review of ASR technologies for children’s speech, in Proceeding of the Workshop on Child, Computer and Interaction (2009), pp. 7:1–7:8

  18. S. Ghai, Addressing Pitch Mismatch for Children’s Automatic Speech Recognition. Ph.D. thesis, Department of EEE, Indian Institute of Technology Guwahati, India (2011)

  19. S. Ghai, R. Sinha, Exploring the role of spectral smoothing in context of children’s speech recognition, in Proceeding of INTERSPEECH (2009), pp. 1607–1610

  20. S. Ghai, R. Sinha, Analyzing pitch robustness of PMVDR and MFCC features for children’s speech recognition, in Proceedings of the Signal Processing and Communications (SPCOM) (2010)

  21. S. Ghai, R. Sinha, Exploring the effect of differences in the acoustic correlates of adults’ and children’s speech in the context of automatic speech recognition. EURASIP J. Audio Speech Music Process. 2010, 7:1–7:15 (2010)

    Article  Google Scholar 

  22. S. Ghai, R. Sinha, A study on the effect of pitch on LPCC and PLPC features for children’s ASR in comparison to MFCC, in Proceedings of INTERSPEECH (2011), pp. 2589–2592

  23. S. Ghai, R. Sinha, Pitch adaptive MFCC features for improving children’s mismatch ASR. Int. J. Spech Technol. 18(3), 489–503 (2015)

    Article  Google Scholar 

  24. S.S. Gray, D. Willett, J. Pinto, J. Lu, P. Maergner, N. Bodenstab, Child automatic speech recognition for US English: child interaction with living-room-electronic-devices, in Proceedings of INTERSPEECH, Workshop on Child, Computer and Interaction (2014)

  25. A. Hagen, B. Pellom, R. Cole, Children’s speech recognition with application to interactive books and tutors, in Proceedings of ASRU (2003), pp. 186–191

  26. A. Hagen, B. Pellom, R. Cole, Highly accurate childrens speech recognition for interactive reading tutors using subword units. Speech Commun. 49(12), 861–873 (2007)

    Article  Google Scholar 

  27. H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 57(4), 1738–52 (1990)

    Article  Google Scholar 

  28. G.E. Hinton, L. Deng, D. Yu, G. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition. Signal Process. Mag. 29(6), 82–97 (2012)

    Article  Google Scholar 

  29. I.T. Jolliffe, Principal Component Analysis (Springer, Berlin, 1986)

    Book  MATH  Google Scholar 

  30. H.K. Kathania, S. Shahnawazuddin, R. Sinha, Exploring HLDA based transformation for reducing acoustic mismatch in context of children speech recognition, in Proceedings of the International Conference on Signal Processing and Communications (SPCOM) (2014), pp. 1–5

  31. N. Kumar, A.G. Andreou, Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Commun. 26(4), 283–297 (1998)

    Article  Google Scholar 

  32. J. Laroche, Time and Pitch Scale Modification of Audio Signals (Springer, Boston, 2002), pp. 279–309

    Google Scholar 

  33. J. Laroche, M. Dolson, Improved phase vocoder time-scale modification of audio. IEEE Trans. Speech Audio Process 7(3), 323–332 (1999)

    Article  Google Scholar 

  34. L. Lee, R. Rose, A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)

    Article  Google Scholar 

  35. S. Lee, A. Potamianos, S.S. Narayanan, Acoustics of children’s speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)

    Article  Google Scholar 

  36. R. Leonard, A database for speaker-independent digit recognition, in Proceedings of ICASSP (1984), pp. 42.11.1–42.11.4

  37. H. Liao, G. Pundak, O. Siohan, M.K. Carroll, N. Coccaro, Q. Jiang, T.N. Sainath, A.W. Senior, F. Beaufays, M. Bacchiani, Large vocabulary automatic speech recognition for children, in Proceedings of INTERSPEECH (2015), pp. 1611–1615

  38. S. Matsoukas, R. Schwartz, H. Jin, L. Nguyen, Practical implementations of speaker-adaptive training, in Proceedings of DARPA Speech Recognition Workshop (1997)

  39. P. McLeod, Fast, Accurate Pitch Detection Tools for Music Analysis. Ph.D. thesis, University of Otago, Dunedin, New Zealand (2008)

  40. A. Metallinou, J. Cheng, Using deep neural networks to improve proficiency assessment for children English language learners, in Proceedings of INTERSPEECH (2014), pp. 1468–1472

  41. S. Narayanan, A. Potamianos, Creating conversational interfaces for children. IEEE Trans. Speech Audio Process. 10(2), 65–78 (2002)

    Article  Google Scholar 

  42. R. Nisimura, A. Lee, H. Saruwatari, K. Shikano, Public speech-oriented guidance system with adult and child discrimination capability, in Proceedings of ICASSP, vol. 1 (2004), pp. 433–436

  43. A. Potaminaos, S. Narayanan, Robust recognition of children speech. IEEE Trans. Speech Audio Process. 11(6), 603–616 (2003)

    Article  Google Scholar 

  44. D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow, R.C. Rose, P. Schwarz, S. Thomas, The subspace Gaussian mixture model—a structured model for speech recognition. Comput. Speech Lang. 25(2), 404–439 (2011)

    Article  Google Scholar 

  45. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi speech recognition toolkit, in Proceedings of ASRU (2011)

  46. L. Rabiner, B.H. Juang, Fundamentals of Speech Recognition (Prentice-Hall Inc, Upper Saddle River, 1993)

    MATH  Google Scholar 

  47. S.P. Rath, D. Povey, K. Veselý, J. Černocký, Improved feature processing for deep neural networks, in Proceedings of INTERSPEECH (2013)

  48. T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition, in Proceedings of ICASSP, vol 1 (1995), pp. 81–84

  49. M. Russell, S. D’Arcy, Challenges for computer recognition of children’s speech, in Proceedings of Speech and Language Technologies in Education (SLaTE) (2007)

  50. J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba, M. Cohen, M. Kamvar, B. Strope, Your word is my command: Google search by voice: a case study, in Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics, chap. 4, ed. By A. Neustein (Springer, 2010), pp. 61–90. doi:10.1007/978-1-4419-5951-5_4

  51. R. Serizel, D. Giuliani, Vocal tract length normalisation approaches to dnn-based children’s and adults’ speech recognition, in Proceedings of the Spoken Language Technology Workshop (SLT) (2014), pp. 135–140

  52. S. Shahnawazuddin, Improving childrens mismatched ASR through adaptive pitch compensation. Ph.D. thesis, Department of EEE, Indian Institute of Technology Guwahati, India (2016)

  53. S. Shahnawazuddin, K.T. Deepak, G. Pradhan, R. Sinha, Enhancing noise and pitch robustness of children’s ASR, in Proceedings of ICASSP (2017), pp. 5225–5229

  54. S. Shahnawazuddin, A. Dey, R. Sinha, Pitch-adaptive front-end features for robust children’s ASR. In: Proceedings of INTERSPEECH (2016)

  55. S. Shahnawazuddin, H.K. Kathania, R. Sinha, Enhancing the recognition of children’s speech on acoustically mismatched ASR system. In: Proceedings of TENCON (2015)

  56. S. Shahnawazuddin, R. Sinha, Low-memory fast on-line adaptation for acoustically mismatched children’s speech recognition, in Proceedings of INTERSPEECH (2015)

  57. S. Shahnawazuddin, R. Sinha, Sparse coding over redundant dictionaries for fast adaptation of speech recognition system. Comput. Speech Lang. 43, 1–17 (2017)

    Article  Google Scholar 

  58. S. Shahnawazuddin, R. Sinha, G. Pradhan, Pitch-normalized acoustic features for robust children’s speech recognition. IEEE Signal Process. Lett. 24(8), 1128–1132 (2017)

    Article  Google Scholar 

  59. X. Shao, B. Milner, Pitch prediction from MFCC vectors for speech reconstruction, in Proceedings of ICASSP (2004), pp. 97–100

  60. P.G. Shivakumar, A. Potamianos, S. Lee, S. Narayanan, Improving speech recognition for children using acoustic adaptation and pronunciation modeling, in Proceedings of the Workshop on Child Computer Interaction (2014)

  61. H. Singer, S. Sagayama, Pitch dependent phone modelling for HMM based speech recognition, in Proceedings of ICASSP (1992), pp. 273–276

  62. R. Sinha, S. Ghai, On the use of pitch normalization for improving children’s speech recognition, in Proceedings of INTERSPEECH (2009), pp. 568–571

  63. R. Sinha, S. Shahnawazuddin, P.S. Karthik, Exploring the role of pitch-adaptive cepstral features in context of children’s mismatched ASR, in Proceedings of the International Conference on Signal Processing and Communications (SPCOM) (2016), pp. 1–5

  64. X. Zhu, G.T. Beauregard, L.L. Wyse, Real-time signal estimation from modified short-time Fourier transform magnitude spectra. IEEE Trans. Audio Speech Lang. Process. 15(5), 1645–1653 (2007)

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to express sincere gratitude to the anonymous reviewers for their thoughtful comments and suggestions which greatly helped in improving the quality of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hemant Kumar Kathania.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kathania, H.K., Ahmad, W., Shahnawazuddin, S. et al. Explicit Pitch Mapping for Improved Children’s Speech Recognition. Circuits Syst Signal Process 37, 2021–2044 (2018). https://doi.org/10.1007/s00034-017-0652-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-017-0652-0

Keywords

Navigation