Addressing Effects of Formant Dispersion and Pitch Sensitivity for the Development of Children’s KWS System

Rout, Jayant Kumar; Pradhan, Gayadhar

doi:10.1007/978-3-031-48309-7_42

Jayant Kumar Rout¹³ &
Gayadhar Pradhan¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14338))

Included in the following conference series:

International Conference on Speech and Computer

428 Accesses

Abstract

The accuracy of an automatic keyword spotting (KWS) system is observed to reduce in presence of mismatches such as pitch, speaking rate, formant dispersion, and background noise. To address these mismatches to some extent, this paper proposes a simple and efficient technique through front-end speech parameterization. In the proposed approach, firstly, the formant dispersion is suppressed by temporal averaging of the short-term magnitude spectra (ST-MS) over adjacent frames. Next, the high-frequency oscillations due to pitch harmonics are smoothed out by processing through a low-pass data adaptive single pole filter (DA-SPF), whose pole value changes adaptively for each analysis frame. It provides a non-uniform spectral smoothing for voiced and non-voiced speech frames. The Mel frequency cepstral coefficient (MFCC) extracted from the smoothed spectra is appended with five logarithmically compressed resonant peaks to construct the acoustic feature termed as temporal averaged smoothed spectra (TASS)-MFCC-ARP. The TASS-MFCC-ARP results in a relative improvement of \(104.07\%\) compared to baseline MFCC for pitch mismatched test conditions on a deep neural network - hidden Markov model (DNN-HMM) based KWS system. As the bandwidth of filters used for computation of MFCC has a direct impact on pitch harmonics of ST-MS, we have next studied the performance of the proposed feature for varying sizes of Mel-filterbank. A notable performance gain for the KWS system is shown by decreasing the Mel-filterbank size. A further improvement in pitch and speaking rate variations is also achieved by data-augmented training through prosody modification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anastasakos, T., McDonough, J., Schwartz, R., Makhoul, J.: A compact model for speaker-adaptive training. In: Proceedings of International Conference on Spoken Language Processing, vol. 2, pp. 1137–1140 (1996)
Google Scholar
Batliner, A., et al.: The PF-STAR children’s speech corpus. In: Proceedings of INTERSPEECH, pp. 2761–2764 (2005)
Google Scholar
Burget, L., et al.: Indexing and search methods for spoken documents. In: Proceedings of 9th International Conference on Text, Speech and Dialogue, pp. 351–358 (2006)
Google Scholar
Byrd, D.: Preliminary results on speaker-dependent variation in the TIMIT database. J. Acoust. Soc. Am. 92(1), 593–596 (1992)
Article Google Scholar
Eguchi, S., Hirsh, I.J.: Development of speech sounds in children. Acta Otolaryngol. Suppl. 257, 1–51 (1969)
Google Scholar
Fraser, N.M.: Voice-based dialogue in the real world. In: Proceedings of Human Comfort and Security of Information Systems, pp. 75–86 (1997)
Google Scholar
Gales, M.J.F.: Cluster adaptive training of hidden Markov models. IEEE Trans. Speech Audio Process. 8(4), 417–428 (2000)
Article Google Scholar
Gauvain, J.L., Lee, C.H.: Maximum a-posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2(2), 291–298 (1994)
Article Google Scholar
Gerosa, M., Giuliani, D., Brugnara, F.: Acoustic variability and automatic recognition of children’s speech. Speech Commun. 49(10–11), 847–860 (2007)
Article Google Scholar
Giuliani, D., Gerosa, M., Brugnara, F.: Improved automatic speech recognition through speaker normalization. Comput. Speech Lang. 20(1), 107–123 (2006)
Article Google Scholar
Joshi, V., Prasad, N.V., Umesh, S.: Modified mean and variance normalization: transforming to utterance-specific estimates. Circ. Syst. Signal Process. 35(5), 1593–1609 (2016)
Article Google Scholar
Kumar, A., Shahnawazuddin, S., Pradhan, G.: Non-local estimation of speech signal for vowel onset point detection in varied environments. In: Proceedings of INTERSPEECH, pp. 429–433 (2017)
Google Scholar
Lee, L., Rose, R.: A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)
Article Google Scholar
Lee, S., Potamianos, A., Narayanan, S.S.: Acoustics of children’s speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)
Article Google Scholar
Maity, K., Pradhan, G., Singh, J.P.: A pitch and noise robust keyword spotting system using SMAC features with prosody modification. Circ. Syst. Signal Process. 40(4), 1892–1904 (2021)
Article Google Scholar
Makhoul, J., et al.: Speech and language technologies for audio indexing and retrieval. Proc. IEEE 88(8), 1338–1353 (2000)
Article Google Scholar
Mamou, J., Ramabhadran, B., Siohan, O.: Vocabulary independent spoken term detection. In: Proceedings of the 30th Annual International Conference on Research and Development in Information Retrieval, pp. 615–622 (2007)
Google Scholar
Michaely, A.H., Zhang, X., Simko, G., Parada, C., Aleksic, P.: Keyword spotting for google assistant using contextual speech recognition. In: Proceedings of Automatic Speech Recognition and Understanding Workshop, pp. 272–278 (2017)
Google Scholar
Pattanayak, B., Pradhan, G.: Pitch-robust acoustic feature using single frequency filtering for children’s KWS. Pattern Recogn. Lett. 150, 183–188 (2021)
Article Google Scholar
Pattanayak, B., Rout, J.K., Pradhan, G.: Adaptive spectral smoothening for development of robust keyword spotting system. IET Signal Process. 13(5), 544–550 (2019)
Article Google Scholar
Potamianos, A., Narayanan, S.: Robust recognition of children’s speech. IEEE Trans. Speech Audio Process. 11(6), 603–616 (2003)
Article Google Scholar
Potamianos, A., Narayanan, S., Lee, S.: Automatic speech recognition for children. In: Eurospeech, vol. 97, pp. 2371–2374 (1997)
Google Scholar
Povey, D., et al.: The kaldi speech recognition toolkit. In: Proceedings of Workshop on Automatic Speech Recognition and Understanding (2011)
Google Scholar
Prasanna, S., Govind, D., Rao, K.S., Yegnanarayana, B.: Fast prosody modification using instants of significant excitation. In: Proceedings of Speech Prosody (2010)
Google Scholar
Rath, S.P., Povey, D., Veselỳ, K., Cernockỳ, J.: Improved feature processing for deep neural networks. In: Proceedings of INTERSPEECH, pp. 109–113 (2013)
Google Scholar
Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S.: WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 81–84 (1995)
Google Scholar
Rout, J.K., Pradhan, G.: Data-adaptive single-pole filtering of magnitude spectra for robust keyword spotting. Circ. Syst. Signal Process. 41(5), 3023–3039 (2022)
Article Google Scholar
Rout, J.K., Pradhan, G.: Enhancement of formant regions in magnitude spectra to develop children’s KWS system in zero resource scenario. Speech Commun. 144, 101–109 (2022)
Article Google Scholar
Russell, M., D’Arcy, S.: Challenges for computer recognition of children’s speech. In: Proceedings of Workshop on Speech and Language Technology in Education (2007)
Google Scholar
Shahnawazuddin, S., Maity, K., Pradhan, G.: Improving the performance of keyword spotting system for children’s speech through prosody modification. Dig. Signal Process. 86, 11–18 (2018)
Article Google Scholar
Sinha, R., Shahnawazuddin, S.: Assessment of pitch-adaptive front-end signal processing for children’s speech recognition. Comput. Speech Lang. 48, 103–121 (2018)
Article Google Scholar
Warren, R.L.: Broadcast speech recognition system for keyword monitoring, US Patent 6332120 (2001)
Google Scholar
Wegmann, S., Faria, A., Janin, A., Riedhammer, K., Morgan, N.: The tao of ATWV: probing the mysteries of keyword search performance. In: Proceedings of Workshop on Automatic Speech Recognition and Understanding, pp. 192–197 (2013)
Google Scholar
Yadav, I.C., Kumar, A., Shahnawazuddin, S., Pradhan, G.: Non-uniform spectral smoothing for robust children’s speech recognition. In: Proceedings of INTERSPEECH, pp. 1601–1605 (2018)
Google Scholar
Yadav, I.C., Pradhan, G.: Significance of pitch-based spectral normalization for children’s speech recognition. IEEE Signal Process. Lett. 26(12), 1822–1826 (2019)
Article Google Scholar
Yadav, I.C., Pradhan, G.: Pitch and noise normalized acoustic feature for children’s ASR. Dig. Signal Process. 109, 102–922 (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, NIT, Patna, Patna, India
Jayant Kumar Rout & Gayadhar Pradhan

Authors

Jayant Kumar Rout
View author publications
You can also search for this author in PubMed Google Scholar
Gayadhar Pradhan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jayant Kumar Rout .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Koneru Lakshmaiah Education Foundation, Vaddeswaram, India
K. Samudravijaya
Indian Institute of Information Technology Dharwad, Dharwad, India
K. T. Deepak
Indian Institute of Technology Dharwad, Dharwad, India
Rajesh M. Hegde
KIIT Group of Colleges, Gurugram, India
Shyam S. Agrawal
Indian Institute of Technology Dharwad, Dharwad, India
S. R. Mahadeva Prasanna

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rout, J.K., Pradhan, G. (2023). Addressing Effects of Formant Dispersion and Pitch Sensitivity for the Development of Children’s KWS System. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14338. Springer, Cham. https://doi.org/10.1007/978-3-031-48309-7_42

Download citation

DOI: https://doi.org/10.1007/978-3-031-48309-7_42
Published: 22 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48308-0
Online ISBN: 978-3-031-48309-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Addressing Effects of Formant Dispersion and Pitch Sensitivity for the Development of Children’s KWS System