Abstract
This paper proposes a simple and effective data-adaptive smoothing approach to suppress the pitch and environment-induced mismatches in keyword spotting (KWS) systems. In the proposed method, the magnitude spectra are smoothed by processing through a data-adaptive single-pole filter (DA-SPF) before computation of Mel frequency cepstral coefficients (MFCCs) to filter out the high-frequency components, which are mainly due to the pitch periodicity. The pole magnitude, which controls spectral smoothing, is changed adaptively for each analysis frame depending on the normalized spectral magnitude in 0–2500 Hz frequency band. The formant magnitude of the voiced sound units is predominant in this frequency band. Consequently, the magnitude spectra of pitch-sensitive voiced frames are relatively more smoothed than the non-voiced frames. When the KWS systems are developed using MFCCs extracted from the DA-SPF smoothed spectra, referred to as single-pole smoothed (SPS)-MFCCs, significantly improved KWS performances are observed in pitch and noise mismatched test conditions. The SPS-MFCCs result in a relative improvement of 86.12% on the DNN-HMM-based KWS system over the MFCCs baseline for pitch mismatched test conditions.
Similar content being viewed by others
Data Availability
The two different speech corpora used in this paper for experimental evaluations are available online at WSJCAM0 Cambridge Read News and the PF-STAR British English Children’s Speech Corpus.
References
A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker, M. Russell, M. Wong, The pf-star children’s speech corpus, in Proceeding on INTERSPEECH, pp. 2761–2764 (2005)
V. Digalakis, D. Rtischev, L. Neumeyer, Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans. Speech Audio Process. 3(5), 357–366 (1995)
J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, V. Zue, TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1, vol. 33 (Linguistic Data Consortium, Philadelphia, 1993)
M. Gerosa, D. Giuliani, S. Narayanan, A. Potamianos, A review of ASR technologies for children’s speech, in Proceedings on Workshop on Child, Computer and Interaction, pp. 7:1–7:8 (2009)
S. Ghai, R. Sinha, Exploring the role of spectral smoothing in context of children’s speech recognition, in Proceedings of INTERSPEECH, pp. 1607–1610 (2009)
G.E. Hinton, L. Deng, D. Yu, G. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
A. Kumar, S. Shahnawazuddin, G. Pradhan, Non-local estimation of speech signal for vowel onset point detection in varied environments, in Proceedings of INTERSPEECH, pp. 429–433 (2017)
L. Lee, R. Rose, A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)
S. Lee, A. Potamianos, S.S. Narayanan, Acoustics of children’s speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)
K. Maity, G. Pradhan, J.P. Singh, A pitch and noise robust keyword spotting system using SMAC features with prosody modification. Circuits Syst. Signal Process. 40(4), 1892–1904 (2021)
J. Makhoul, F. Kubala, T. Leek, D. Liu, L. Nguyen, R. Schwartz, A. Srivastava, Speech and language technologies for audio indexing and retrieval. Proc. IEEE 88(8), 1338–1353 (2000)
K.S.R. Murthy, B. Yegnanarayana, Epoch extraction from speech signals. Trans. Audio Speech Lang. Process. 16, 1602–1613 (2008)
S. Narayanan, A. Potamianos, Creating conversational interfaces for children. IEEE Trans. Speech Audio Process. 10(2), 65–78 (2002)
B. Pattanayak, J.K. Rout, G. Pradhan, Adaptive spectral smoothening for development of robust keyword spotting system. IET Signal Proc. 13(5), 544–550 (2019)
A. Potamianos, S. Narayanan, Robust recognition of children’s speech. IEEE Trans. Speech Audio Process. 11(6), 603–616 (2003)
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi speech recognition toolkit, in Proceedings of Workshop on Automatic Speech Recognition and Understanding (ASRU) (2011)
S. Prasanna, D. Govind, K.S. Rao, B. Yegnanarayana, Fast prosody modification using instants of significant excitation, in Proceedings of Speech Prosody (2010)
S.P. Rath, D. Povey, K. Veselỳ, J. Cernockỳ, Improved feature processing for deep neural networks, in Proceedings of INTERSPEECH, pp. 109–113 (2013)
T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition, in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP) vol. 1, pp. 81–84 (1995)
J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba, M. Cohen, M. Kamvar, B. Strope, Your word is my command: Google search by voice: a case study, in Advances in Speech Recognition: Mobile Environments. ed. by A. Neustein (Call Centers and Clinics, Springer, Boston, MA, 2010), pp. 61–90
S. Shahnawazuddin, A. Dey, R. Sinha, Pitch-adaptive front-end features for robust children’s ASR, in Proceedings of INTERSPEECH (2016)
S. Shahnawazuddin, K. Maity, G. Pradhan, Improving the performance of keyword spotting system for children’s speech through prosody modification. Digital Signal Process. 86, 11–18 (2018)
S. Shahnawazuddin, R. Sinha, G. Pradhan, Pitch-normalized acoustic features for robust children’s speech recognition. IEEE Signal Process. Lett. 24(8), 1128–1132 (2017)
P.G. Shivakumar, A. Potamianos, S. Lee, S. Narayanan, Improving speech recognition for children using acoustic adaptation and pronunciation modeling, in Proceedings of Workshop on Child Computer Interaction (2014)
R. Sinha, S. Ghai, On the use of pitch normalization for improving children’s speech recognition, in Proceedings of INTERSPEECH, pp. 568–571 (2009)
K. Sjölander, J. Beskow, Wavesurfer—an open source speech tool, in Proceedings of INTERSPEECH, pp. 464 – 467 (2000)
A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
D. Vergyri, I. Shafran, A. Stolcke, R.R. Gadde, M. Akbacak, B. Roark, W. Wang, The SRI/OGI 2006 spoken term detection system, in Proceedings of Eighth Annual Conference of the International Speech Communication Association (2007)
R.L. Warren, Broadcast speech recognition system for keyword monitoring. US Patent 6,332,120 (2001)
S. Wegmann, A. Faria, A. Janin, K. Riedhammer, N. Morgan, The TAO of ATWV: Probing the mysteries of keyword search performance, in Proceedings of Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 192–197 (2013)
I.C. Yadav, A. Kumar, S. Shahnawazuddin, G. Pradhan, Non-uniform spectral smoothing for robust children’s speech recognition, in Proceedings on INTERSPEECH, pp. 1601–1605 (2018)
I.C. Yadav, G. Pradhan, Significance of pitch-based spectral normalization for children’s speech recognition. IEEE Signal Process. Lett. 26(12), 1822–1826 (2019)
I.C. Yadav, S. Shahnawazuddin, G. Pradhan, Spectral smoothing by variational mode decomposition and its effect on noise and pitch robustness of ASR system, in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5629–5633 (2018)
I.C. Yadav, S. Shahnawazuddin, G. Pradhan, Addressing noise and pitch sensitivity of speech recognition system through variational mode decomposition based spectral smoothing. Digital Signal Process. 86, 55–64 (2019)
M. Zbancioc, M. Costin, Using neural networks and LPCC to improve speech recognition, in Proceedings of SCS 2003. International Symposium on Signals, Circuits and Systems, vol. 2, pp. 445–448 (2003)
N. Zhao, H. Yang, Realizing speech to gesture conversion by keyword spotting, in Proceedings of Chinese Spoken Language Processing (ISCSLP), pp. 1–5 (2016)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Rout, J.K., Pradhan, G. Data-Adaptive Single-Pole Filtering of Magnitude Spectra for Robust Keyword Spotting. Circuits Syst Signal Process 41, 3023–3039 (2022). https://doi.org/10.1007/s00034-021-01923-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-021-01923-2