Skip to main content

Addressing Effects of Formant Dispersion and Pitch Sensitivity for the Development of Children’s KWS System

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14338))

Included in the following conference series:

  • 428 Accesses

Abstract

The accuracy of an automatic keyword spotting (KWS) system is observed to reduce in presence of mismatches such as pitch, speaking rate, formant dispersion, and background noise. To address these mismatches to some extent, this paper proposes a simple and efficient technique through front-end speech parameterization. In the proposed approach, firstly, the formant dispersion is suppressed by temporal averaging of the short-term magnitude spectra (ST-MS) over adjacent frames. Next, the high-frequency oscillations due to pitch harmonics are smoothed out by processing through a low-pass data adaptive single pole filter (DA-SPF), whose pole value changes adaptively for each analysis frame. It provides a non-uniform spectral smoothing for voiced and non-voiced speech frames. The Mel frequency cepstral coefficient (MFCC) extracted from the smoothed spectra is appended with five logarithmically compressed resonant peaks to construct the acoustic feature termed as temporal averaged smoothed spectra (TASS)-MFCC-ARP. The TASS-MFCC-ARP results in a relative improvement of \(104.07\%\) compared to baseline MFCC for pitch mismatched test conditions on a deep neural network - hidden Markov model (DNN-HMM) based KWS system. As the bandwidth of filters used for computation of MFCC has a direct impact on pitch harmonics of ST-MS, we have next studied the performance of the proposed feature for varying sizes of Mel-filterbank. A notable performance gain for the KWS system is shown by decreasing the Mel-filterbank size. A further improvement in pitch and speaking rate variations is also achieved by data-augmented training through prosody modification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Anastasakos, T., McDonough, J., Schwartz, R., Makhoul, J.: A compact model for speaker-adaptive training. In: Proceedings of International Conference on Spoken Language Processing, vol. 2, pp. 1137–1140 (1996)

    Google Scholar 

  2. Batliner, A., et al.: The PF-STAR children’s speech corpus. In: Proceedings of INTERSPEECH, pp. 2761–2764 (2005)

    Google Scholar 

  3. Burget, L., et al.: Indexing and search methods for spoken documents. In: Proceedings of 9th International Conference on Text, Speech and Dialogue, pp. 351–358 (2006)

    Google Scholar 

  4. Byrd, D.: Preliminary results on speaker-dependent variation in the TIMIT database. J. Acoust. Soc. Am. 92(1), 593–596 (1992)

    Article  Google Scholar 

  5. Eguchi, S., Hirsh, I.J.: Development of speech sounds in children. Acta Otolaryngol. Suppl. 257, 1–51 (1969)

    Google Scholar 

  6. Fraser, N.M.: Voice-based dialogue in the real world. In: Proceedings of Human Comfort and Security of Information Systems, pp. 75–86 (1997)

    Google Scholar 

  7. Gales, M.J.F.: Cluster adaptive training of hidden Markov models. IEEE Trans. Speech Audio Process. 8(4), 417–428 (2000)

    Article  Google Scholar 

  8. Gauvain, J.L., Lee, C.H.: Maximum a-posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2(2), 291–298 (1994)

    Article  Google Scholar 

  9. Gerosa, M., Giuliani, D., Brugnara, F.: Acoustic variability and automatic recognition of children’s speech. Speech Commun. 49(10–11), 847–860 (2007)

    Article  Google Scholar 

  10. Giuliani, D., Gerosa, M., Brugnara, F.: Improved automatic speech recognition through speaker normalization. Comput. Speech Lang. 20(1), 107–123 (2006)

    Article  Google Scholar 

  11. Joshi, V., Prasad, N.V., Umesh, S.: Modified mean and variance normalization: transforming to utterance-specific estimates. Circ. Syst. Signal Process. 35(5), 1593–1609 (2016)

    Article  Google Scholar 

  12. Kumar, A., Shahnawazuddin, S., Pradhan, G.: Non-local estimation of speech signal for vowel onset point detection in varied environments. In: Proceedings of INTERSPEECH, pp. 429–433 (2017)

    Google Scholar 

  13. Lee, L., Rose, R.: A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)

    Article  Google Scholar 

  14. Lee, S., Potamianos, A., Narayanan, S.S.: Acoustics of children’s speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)

    Article  Google Scholar 

  15. Maity, K., Pradhan, G., Singh, J.P.: A pitch and noise robust keyword spotting system using SMAC features with prosody modification. Circ. Syst. Signal Process. 40(4), 1892–1904 (2021)

    Article  Google Scholar 

  16. Makhoul, J., et al.: Speech and language technologies for audio indexing and retrieval. Proc. IEEE 88(8), 1338–1353 (2000)

    Article  Google Scholar 

  17. Mamou, J., Ramabhadran, B., Siohan, O.: Vocabulary independent spoken term detection. In: Proceedings of the 30th Annual International Conference on Research and Development in Information Retrieval, pp. 615–622 (2007)

    Google Scholar 

  18. Michaely, A.H., Zhang, X., Simko, G., Parada, C., Aleksic, P.: Keyword spotting for google assistant using contextual speech recognition. In: Proceedings of Automatic Speech Recognition and Understanding Workshop, pp. 272–278 (2017)

    Google Scholar 

  19. Pattanayak, B., Pradhan, G.: Pitch-robust acoustic feature using single frequency filtering for children’s KWS. Pattern Recogn. Lett. 150, 183–188 (2021)

    Article  Google Scholar 

  20. Pattanayak, B., Rout, J.K., Pradhan, G.: Adaptive spectral smoothening for development of robust keyword spotting system. IET Signal Process. 13(5), 544–550 (2019)

    Article  Google Scholar 

  21. Potamianos, A., Narayanan, S.: Robust recognition of children’s speech. IEEE Trans. Speech Audio Process. 11(6), 603–616 (2003)

    Article  Google Scholar 

  22. Potamianos, A., Narayanan, S., Lee, S.: Automatic speech recognition for children. In: Eurospeech, vol. 97, pp. 2371–2374 (1997)

    Google Scholar 

  23. Povey, D., et al.: The kaldi speech recognition toolkit. In: Proceedings of Workshop on Automatic Speech Recognition and Understanding (2011)

    Google Scholar 

  24. Prasanna, S., Govind, D., Rao, K.S., Yegnanarayana, B.: Fast prosody modification using instants of significant excitation. In: Proceedings of Speech Prosody (2010)

    Google Scholar 

  25. Rath, S.P., Povey, D., Veselỳ, K., Cernockỳ, J.: Improved feature processing for deep neural networks. In: Proceedings of INTERSPEECH, pp. 109–113 (2013)

    Google Scholar 

  26. Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S.: WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 81–84 (1995)

    Google Scholar 

  27. Rout, J.K., Pradhan, G.: Data-adaptive single-pole filtering of magnitude spectra for robust keyword spotting. Circ. Syst. Signal Process. 41(5), 3023–3039 (2022)

    Article  Google Scholar 

  28. Rout, J.K., Pradhan, G.: Enhancement of formant regions in magnitude spectra to develop children’s KWS system in zero resource scenario. Speech Commun. 144, 101–109 (2022)

    Article  Google Scholar 

  29. Russell, M., D’Arcy, S.: Challenges for computer recognition of children’s speech. In: Proceedings of Workshop on Speech and Language Technology in Education (2007)

    Google Scholar 

  30. Shahnawazuddin, S., Maity, K., Pradhan, G.: Improving the performance of keyword spotting system for children’s speech through prosody modification. Dig. Signal Process. 86, 11–18 (2018)

    Article  Google Scholar 

  31. Sinha, R., Shahnawazuddin, S.: Assessment of pitch-adaptive front-end signal processing for children’s speech recognition. Comput. Speech Lang. 48, 103–121 (2018)

    Article  Google Scholar 

  32. Warren, R.L.: Broadcast speech recognition system for keyword monitoring, US Patent 6332120 (2001)

    Google Scholar 

  33. Wegmann, S., Faria, A., Janin, A., Riedhammer, K., Morgan, N.: The tao of ATWV: probing the mysteries of keyword search performance. In: Proceedings of Workshop on Automatic Speech Recognition and Understanding, pp. 192–197 (2013)

    Google Scholar 

  34. Yadav, I.C., Kumar, A., Shahnawazuddin, S., Pradhan, G.: Non-uniform spectral smoothing for robust children’s speech recognition. In: Proceedings of INTERSPEECH, pp. 1601–1605 (2018)

    Google Scholar 

  35. Yadav, I.C., Pradhan, G.: Significance of pitch-based spectral normalization for children’s speech recognition. IEEE Signal Process. Lett. 26(12), 1822–1826 (2019)

    Article  Google Scholar 

  36. Yadav, I.C., Pradhan, G.: Pitch and noise normalized acoustic feature for children’s ASR. Dig. Signal Process. 109, 102–922 (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jayant Kumar Rout .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rout, J.K., Pradhan, G. (2023). Addressing Effects of Formant Dispersion and Pitch Sensitivity for the Development of Children’s KWS System. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14338. Springer, Cham. https://doi.org/10.1007/978-3-031-48309-7_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-48309-7_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-48308-0

  • Online ISBN: 978-3-031-48309-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics