Skip to main content
Log in

Data-Adaptive Single-Pole Filtering of Magnitude Spectra for Robust Keyword Spotting

  • Short Paper
  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

This paper proposes a simple and effective data-adaptive smoothing approach to suppress the pitch and environment-induced mismatches in keyword spotting (KWS) systems. In the proposed method, the magnitude spectra are smoothed by processing through a data-adaptive single-pole filter (DA-SPF) before computation of Mel frequency cepstral coefficients (MFCCs) to filter out the high-frequency components, which are mainly due to the pitch periodicity. The pole magnitude, which controls spectral smoothing, is changed adaptively for each analysis frame depending on the normalized spectral magnitude in 0–2500 Hz frequency band. The formant magnitude of the voiced sound units is predominant in this frequency band. Consequently, the magnitude spectra of pitch-sensitive voiced frames are relatively more smoothed than the non-voiced frames. When the KWS systems are developed using MFCCs extracted from the DA-SPF smoothed spectra, referred to as single-pole smoothed (SPS)-MFCCs, significantly improved KWS performances are observed in pitch and noise mismatched test conditions. The SPS-MFCCs result in a relative improvement of 86.12% on the DNN-HMM-based KWS system over the MFCCs baseline for pitch mismatched test conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availability

The two different speech corpora used in this paper for experimental evaluations are available online at WSJCAM0 Cambridge Read News and the PF-STAR British English Children’s Speech Corpus.

References

  1. A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker, M. Russell, M. Wong, The pf-star children’s speech corpus, in Proceeding on INTERSPEECH, pp. 2761–2764 (2005)

  2. V. Digalakis, D. Rtischev, L. Neumeyer, Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans. Speech Audio Process. 3(5), 357–366 (1995)

    Article  Google Scholar 

  3. J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, V. Zue, TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1, vol. 33 (Linguistic Data Consortium, Philadelphia, 1993)

    Google Scholar 

  4. M. Gerosa, D. Giuliani, S. Narayanan, A. Potamianos, A review of ASR technologies for children’s speech, in Proceedings on Workshop on Child, Computer and Interaction, pp. 7:1–7:8 (2009)

  5. S. Ghai, R. Sinha, Exploring the role of spectral smoothing in context of children’s speech recognition, in Proceedings of INTERSPEECH, pp. 1607–1610 (2009)

  6. G.E. Hinton, L. Deng, D. Yu, G. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)

    Article  Google Scholar 

  7. A. Kumar, S. Shahnawazuddin, G. Pradhan, Non-local estimation of speech signal for vowel onset point detection in varied environments, in Proceedings of INTERSPEECH, pp. 429–433 (2017)

  8. L. Lee, R. Rose, A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)

    Article  Google Scholar 

  9. S. Lee, A. Potamianos, S.S. Narayanan, Acoustics of children’s speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)

    Article  Google Scholar 

  10. K. Maity, G. Pradhan, J.P. Singh, A pitch and noise robust keyword spotting system using SMAC features with prosody modification. Circuits Syst. Signal Process. 40(4), 1892–1904 (2021)

    Article  Google Scholar 

  11. J. Makhoul, F. Kubala, T. Leek, D. Liu, L. Nguyen, R. Schwartz, A. Srivastava, Speech and language technologies for audio indexing and retrieval. Proc. IEEE 88(8), 1338–1353 (2000)

    Article  Google Scholar 

  12. K.S.R. Murthy, B. Yegnanarayana, Epoch extraction from speech signals. Trans. Audio Speech Lang. Process. 16, 1602–1613 (2008)

    Article  Google Scholar 

  13. S. Narayanan, A. Potamianos, Creating conversational interfaces for children. IEEE Trans. Speech Audio Process. 10(2), 65–78 (2002)

    Article  Google Scholar 

  14. B. Pattanayak, J.K. Rout, G. Pradhan, Adaptive spectral smoothening for development of robust keyword spotting system. IET Signal Proc. 13(5), 544–550 (2019)

    Article  Google Scholar 

  15. A. Potamianos, S. Narayanan, Robust recognition of children’s speech. IEEE Trans. Speech Audio Process. 11(6), 603–616 (2003)

    Article  Google Scholar 

  16. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi speech recognition toolkit, in Proceedings of Workshop on Automatic Speech Recognition and Understanding (ASRU) (2011)

  17. S. Prasanna, D. Govind, K.S. Rao, B. Yegnanarayana, Fast prosody modification using instants of significant excitation, in Proceedings of Speech Prosody (2010)

  18. S.P. Rath, D. Povey, K. Veselỳ, J. Cernockỳ, Improved feature processing for deep neural networks, in Proceedings of INTERSPEECH, pp. 109–113 (2013)

  19. T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition, in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP) vol. 1, pp. 81–84 (1995)

  20. J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba, M. Cohen, M. Kamvar, B. Strope, Your word is my command: Google search by voice: a case study, in Advances in Speech Recognition: Mobile Environments. ed. by A. Neustein (Call Centers and Clinics, Springer, Boston, MA, 2010), pp. 61–90

    Chapter  Google Scholar 

  21. S. Shahnawazuddin, A. Dey, R. Sinha, Pitch-adaptive front-end features for robust children’s ASR, in Proceedings of INTERSPEECH (2016)

  22. S. Shahnawazuddin, K. Maity, G. Pradhan, Improving the performance of keyword spotting system for children’s speech through prosody modification. Digital Signal Process. 86, 11–18 (2018)

    Article  Google Scholar 

  23. S. Shahnawazuddin, R. Sinha, G. Pradhan, Pitch-normalized acoustic features for robust children’s speech recognition. IEEE Signal Process. Lett. 24(8), 1128–1132 (2017)

    Article  Google Scholar 

  24. P.G. Shivakumar, A. Potamianos, S. Lee, S. Narayanan, Improving speech recognition for children using acoustic adaptation and pronunciation modeling, in Proceedings of Workshop on Child Computer Interaction (2014)

  25. R. Sinha, S. Ghai, On the use of pitch normalization for improving children’s speech recognition, in Proceedings of INTERSPEECH, pp. 568–571 (2009)

  26. K. Sjölander, J. Beskow, Wavesurfer—an open source speech tool, in Proceedings of INTERSPEECH, pp. 464 – 467 (2000)

  27. A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)

    Article  Google Scholar 

  28. D. Vergyri, I. Shafran, A. Stolcke, R.R. Gadde, M. Akbacak, B. Roark, W. Wang, The SRI/OGI 2006 spoken term detection system, in Proceedings of Eighth Annual Conference of the International Speech Communication Association (2007)

  29. R.L. Warren, Broadcast speech recognition system for keyword monitoring. US Patent 6,332,120 (2001)

  30. S. Wegmann, A. Faria, A. Janin, K. Riedhammer, N. Morgan, The TAO of ATWV: Probing the mysteries of keyword search performance, in Proceedings of Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 192–197 (2013)

  31. I.C. Yadav, A. Kumar, S. Shahnawazuddin, G. Pradhan, Non-uniform spectral smoothing for robust children’s speech recognition, in Proceedings on INTERSPEECH, pp. 1601–1605 (2018)

  32. I.C. Yadav, G. Pradhan, Significance of pitch-based spectral normalization for children’s speech recognition. IEEE Signal Process. Lett. 26(12), 1822–1826 (2019)

    Article  Google Scholar 

  33. I.C. Yadav, S. Shahnawazuddin, G. Pradhan, Spectral smoothing by variational mode decomposition and its effect on noise and pitch robustness of ASR system, in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5629–5633 (2018)

  34. I.C. Yadav, S. Shahnawazuddin, G. Pradhan, Addressing noise and pitch sensitivity of speech recognition system through variational mode decomposition based spectral smoothing. Digital Signal Process. 86, 55–64 (2019)

    Article  Google Scholar 

  35. M. Zbancioc, M. Costin, Using neural networks and LPCC to improve speech recognition, in Proceedings of SCS 2003. International Symposium on Signals, Circuits and Systems, vol. 2, pp. 445–448 (2003)

  36. N. Zhao, H. Yang, Realizing speech to gesture conversion by keyword spotting, in Proceedings of Chinese Spoken Language Processing (ISCSLP), pp. 1–5 (2016)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jayant Kumar Rout.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rout, J.K., Pradhan, G. Data-Adaptive Single-Pole Filtering of Magnitude Spectra for Robust Keyword Spotting. Circuits Syst Signal Process 41, 3023–3039 (2022). https://doi.org/10.1007/s00034-021-01923-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-021-01923-2

Keywords

Navigation