Abstract
The accuracy of an automatic keyword spotting (KWS) system is observed to reduce in presence of mismatches such as pitch, speaking rate, formant dispersion, and background noise. To address these mismatches to some extent, this paper proposes a simple and efficient technique through front-end speech parameterization. In the proposed approach, firstly, the formant dispersion is suppressed by temporal averaging of the short-term magnitude spectra (ST-MS) over adjacent frames. Next, the high-frequency oscillations due to pitch harmonics are smoothed out by processing through a low-pass data adaptive single pole filter (DA-SPF), whose pole value changes adaptively for each analysis frame. It provides a non-uniform spectral smoothing for voiced and non-voiced speech frames. The Mel frequency cepstral coefficient (MFCC) extracted from the smoothed spectra is appended with five logarithmically compressed resonant peaks to construct the acoustic feature termed as temporal averaged smoothed spectra (TASS)-MFCC-ARP. The TASS-MFCC-ARP results in a relative improvement of \(104.07\%\) compared to baseline MFCC for pitch mismatched test conditions on a deep neural network - hidden Markov model (DNN-HMM) based KWS system. As the bandwidth of filters used for computation of MFCC has a direct impact on pitch harmonics of ST-MS, we have next studied the performance of the proposed feature for varying sizes of Mel-filterbank. A notable performance gain for the KWS system is shown by decreasing the Mel-filterbank size. A further improvement in pitch and speaking rate variations is also achieved by data-augmented training through prosody modification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anastasakos, T., McDonough, J., Schwartz, R., Makhoul, J.: A compact model for speaker-adaptive training. In: Proceedings of International Conference on Spoken Language Processing, vol. 2, pp. 1137–1140 (1996)
Batliner, A., et al.: The PF-STAR children’s speech corpus. In: Proceedings of INTERSPEECH, pp. 2761–2764 (2005)
Burget, L., et al.: Indexing and search methods for spoken documents. In: Proceedings of 9th International Conference on Text, Speech and Dialogue, pp. 351–358 (2006)
Byrd, D.: Preliminary results on speaker-dependent variation in the TIMIT database. J. Acoust. Soc. Am. 92(1), 593–596 (1992)
Eguchi, S., Hirsh, I.J.: Development of speech sounds in children. Acta Otolaryngol. Suppl. 257, 1–51 (1969)
Fraser, N.M.: Voice-based dialogue in the real world. In: Proceedings of Human Comfort and Security of Information Systems, pp. 75–86 (1997)
Gales, M.J.F.: Cluster adaptive training of hidden Markov models. IEEE Trans. Speech Audio Process. 8(4), 417–428 (2000)
Gauvain, J.L., Lee, C.H.: Maximum a-posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2(2), 291–298 (1994)
Gerosa, M., Giuliani, D., Brugnara, F.: Acoustic variability and automatic recognition of children’s speech. Speech Commun. 49(10–11), 847–860 (2007)
Giuliani, D., Gerosa, M., Brugnara, F.: Improved automatic speech recognition through speaker normalization. Comput. Speech Lang. 20(1), 107–123 (2006)
Joshi, V., Prasad, N.V., Umesh, S.: Modified mean and variance normalization: transforming to utterance-specific estimates. Circ. Syst. Signal Process. 35(5), 1593–1609 (2016)
Kumar, A., Shahnawazuddin, S., Pradhan, G.: Non-local estimation of speech signal for vowel onset point detection in varied environments. In: Proceedings of INTERSPEECH, pp. 429–433 (2017)
Lee, L., Rose, R.: A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)
Lee, S., Potamianos, A., Narayanan, S.S.: Acoustics of children’s speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)
Maity, K., Pradhan, G., Singh, J.P.: A pitch and noise robust keyword spotting system using SMAC features with prosody modification. Circ. Syst. Signal Process. 40(4), 1892–1904 (2021)
Makhoul, J., et al.: Speech and language technologies for audio indexing and retrieval. Proc. IEEE 88(8), 1338–1353 (2000)
Mamou, J., Ramabhadran, B., Siohan, O.: Vocabulary independent spoken term detection. In: Proceedings of the 30th Annual International Conference on Research and Development in Information Retrieval, pp. 615–622 (2007)
Michaely, A.H., Zhang, X., Simko, G., Parada, C., Aleksic, P.: Keyword spotting for google assistant using contextual speech recognition. In: Proceedings of Automatic Speech Recognition and Understanding Workshop, pp. 272–278 (2017)
Pattanayak, B., Pradhan, G.: Pitch-robust acoustic feature using single frequency filtering for children’s KWS. Pattern Recogn. Lett. 150, 183–188 (2021)
Pattanayak, B., Rout, J.K., Pradhan, G.: Adaptive spectral smoothening for development of robust keyword spotting system. IET Signal Process. 13(5), 544–550 (2019)
Potamianos, A., Narayanan, S.: Robust recognition of children’s speech. IEEE Trans. Speech Audio Process. 11(6), 603–616 (2003)
Potamianos, A., Narayanan, S., Lee, S.: Automatic speech recognition for children. In: Eurospeech, vol. 97, pp. 2371–2374 (1997)
Povey, D., et al.: The kaldi speech recognition toolkit. In: Proceedings of Workshop on Automatic Speech Recognition and Understanding (2011)
Prasanna, S., Govind, D., Rao, K.S., Yegnanarayana, B.: Fast prosody modification using instants of significant excitation. In: Proceedings of Speech Prosody (2010)
Rath, S.P., Povey, D., Veselỳ, K., Cernockỳ, J.: Improved feature processing for deep neural networks. In: Proceedings of INTERSPEECH, pp. 109–113 (2013)
Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S.: WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 81–84 (1995)
Rout, J.K., Pradhan, G.: Data-adaptive single-pole filtering of magnitude spectra for robust keyword spotting. Circ. Syst. Signal Process. 41(5), 3023–3039 (2022)
Rout, J.K., Pradhan, G.: Enhancement of formant regions in magnitude spectra to develop children’s KWS system in zero resource scenario. Speech Commun. 144, 101–109 (2022)
Russell, M., D’Arcy, S.: Challenges for computer recognition of children’s speech. In: Proceedings of Workshop on Speech and Language Technology in Education (2007)
Shahnawazuddin, S., Maity, K., Pradhan, G.: Improving the performance of keyword spotting system for children’s speech through prosody modification. Dig. Signal Process. 86, 11–18 (2018)
Sinha, R., Shahnawazuddin, S.: Assessment of pitch-adaptive front-end signal processing for children’s speech recognition. Comput. Speech Lang. 48, 103–121 (2018)
Warren, R.L.: Broadcast speech recognition system for keyword monitoring, US Patent 6332120 (2001)
Wegmann, S., Faria, A., Janin, A., Riedhammer, K., Morgan, N.: The tao of ATWV: probing the mysteries of keyword search performance. In: Proceedings of Workshop on Automatic Speech Recognition and Understanding, pp. 192–197 (2013)
Yadav, I.C., Kumar, A., Shahnawazuddin, S., Pradhan, G.: Non-uniform spectral smoothing for robust children’s speech recognition. In: Proceedings of INTERSPEECH, pp. 1601–1605 (2018)
Yadav, I.C., Pradhan, G.: Significance of pitch-based spectral normalization for children’s speech recognition. IEEE Signal Process. Lett. 26(12), 1822–1826 (2019)
Yadav, I.C., Pradhan, G.: Pitch and noise normalized acoustic feature for children’s ASR. Dig. Signal Process. 109, 102–922 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Rout, J.K., Pradhan, G. (2023). Addressing Effects of Formant Dispersion and Pitch Sensitivity for the Development of Children’s KWS System. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14338. Springer, Cham. https://doi.org/10.1007/978-3-031-48309-7_42
Download citation
DOI: https://doi.org/10.1007/978-3-031-48309-7_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48308-0
Online ISBN: 978-3-031-48309-7
eBook Packages: Computer ScienceComputer Science (R0)