Skip to main content
Log in

Improving speech command recognition through decision-level fusion of deep filtered speech cues

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Living beings communicate through speech, which can be analysed to identify words and sentences by recognizing the flow of spoken utterances. However, background noise will always have an impact on the speech recognition process. The detection rate in the presence of background noise is still unsatisfactory, necessitating further research and potential remedies in the speech recognition process. To improve the noisy speech information, this research suggests speech recognition based on a combination of median filtering and adaptive filtering. In this study, speech command recognition is achieved by employing popular noise reduction techniques and utilizing two parallel channels of filtered speech independently. The procedure involves five steps: firstly, enhancing signals using two parallel independent speech enhancement models (median and adaptive filtering); secondly, extracting 2D Mel spectrogram images from the enhanced signals; and thirdly, passing the 2-dimensional Mel spectrogram images to the tiny Swin Transformer for classification. The classification is performed among the large-scale ImageNet dataset, which consists of 14 million images and is approximately 150 GB in size. Fourth, the posterior probabilities extracted from the tiny Swin Transformer modelling are then fed into our proposed 3-layered feed-forward network for classification among our 10-speech command categories. Lastly, decision-level fusion is applied to the two parallel, independent channels obtained from the 3-layered feed-forward network. For experimentation, the Google Speech Command dataset version 2 is used. We obtained a test accuracy of 99.85% when compared with other state-of-the-art methods, demonstrating satisfactory results that can be reported.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The authors verify that the data underpinning the results and discoveries of this research can be found within the confines of the article.

References

  1. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  Google Scholar 

  2. Iqbal, A., Aftab, S.: A classification framework for software defect prediction using multi-filter feature selection technique and MLP. Int. J. Mod. Educ. Comput. Sci. 12(1), 18 (2020)

    Article  Google Scholar 

  3. Hermansky, H.: The modulation spectrum in the automatic recognition of speech. In: 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, pp. 140–147. IEEE (1997)

  4. Sadhu, S., Hermansky, H.: Importance of different temporal modulations of speech: a tale of two perspectives. In: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)

  5. Narayanan, A., Wang, D.: Ideal ratio mask estimation using deep neural networks for robust speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7092–7096. IEEE (2013)

  6. Wang, Z.-Q., Wang, P., Wang, DeLiang: Complex spectral mapping for single-and multi-channel speech enhancement and robust ASR. IEEE/ACM Trans. Audio, Speech, Lang. Process. 28, 1778–1787 (2020)

    Article  PubMed  Google Scholar 

  7. Luo, Y., Mesgarani, N.: Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 696–700. IEEE (2018)

  8. Liu, Ze, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. "Swin transformer: Hierarchical vision transformer using shifted windows." In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022. 2021.

  9. Rabiner, L.R., Schafer, R.W.: Introduction to digital speech processing. Found. Trendsin® Signal Process. 1(1–2), 1–194 (2007)

    Article  Google Scholar 

  10. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Prentice hall, Upper Saddle River (2008)

    Google Scholar 

  11. Korkmaz, Y., Boyacı, A.: A comprehensive Turkish accent/dialect recognition system using acoustic perceptual formants. Appl. Acoust. 193, 108761 (2022)

    Article  Google Scholar 

  12. Xu, Y., Jun, Du., Dai, L.-R., Lee, C.-H.: A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio, Speech, Lang. Process. 23(1), 7–19 (2014)

    Article  CAS  Google Scholar 

  13. Chen, Z.: Noise reduction of bird calls based on a combination of spectral subtraction, Wiener filtering, and Kalman filtering. In: 2022 IEEE 5th International Conference on Automation, Electronics and Electrical Engineering (AUTEEE), pp. 512–517. IEEE (2022)

  14. Liu, S., Geng, M., Shoukang, Hu., Xie, X., Cui, M., Jianwei, Yu., Liu, X., Meng, H.: Recent progress in the CUHK dysarthric speech recognition system. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29, 2267–2281 (2021)

    Article  Google Scholar 

  15. Xiong, F., Barker, J., Christensen, H.: Phonetic analysis of dysarthric speech tempo and applications to robust personalised dysarthric speech recognition. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5836–5840. IEEE (2019)

  16. Korkmaz, Y., Boyacı, A.: Hybrid voice activity detection system based on LSTM and auditory speech features. Biomed. Signal Process. Control 80, 104408 (2023)

    Article  Google Scholar 

  17. Aksënova, A., Chen, Z., Chiu, C.C., van Esch, D., Golik, G., Han, W., King , L. et al.: Accented speech recognition: benchmarking, pre-training, and diverse data. arXiv preprint arXiv:2205.08014 (2022)

  18. Korkmaz, Y., Boyacı, A.: Analysis of speaker's gender effects in voice onset time of Turkish stop consonants. In: 2018 6th International Symposium on Digital Forensic and Security (ISDFS), pp. 1–5. IEEE (2018)

  19. Sadhu, S., Li, R., Hermansky, H.: M-vectors: sub-band based energy modulation features for multi-stream automatic speech recognition. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6545–6549. IEEE (2019)

  20. Das, P., Bhattacharjee, U.: Robust speaker verification using GFCC and joint factor analysis. In: Fifth International Conference on Computing, Communications and Networking Technologies (ICCCNT), pp. 1–4. IEEE (2014)

  21. Shi, X., Yang, H., Zhou, P.: Robust speaker recognition based on improved GFCC. In: 2016 2nd IEEE international conference on computer and communications (ICCC), pp. 1927–1931. IEEE (2016)

  22. Kim, C., Stern, R.M.: Power-normalized cepstral coefficients (PNCC) for robust speech recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 24(7), 1315–1329 (2016)

    Article  Google Scholar 

  23. Badi, A., Ko, K., Ko, H.: Bird sounds classification by combining PNCC and robust Mel-log filter bank features. J. Acoust. Soc. Korea 38(1), 39–46 (2019)

    Google Scholar 

  24. Passricha, V., Aggarwal, R.K.: A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition. J. Intell. Syst. 29(1), 1261–1274 (2019)

    Google Scholar 

  25. Mukhamadiyev, A., Khujayarov, I., Djuraev, O., Cho, J.: Automatic speech recognition method based on deep learning approaches for Uzbek language. Sensors 22(10), 3683 (2022)

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  26. Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 24–29. IEEE (2011)

  27. Uebel, L.F., Woodland, P.C.: An investigation into vocal tract length normalisation. In: Sixth European Conference on Speech Communication and Technology (1999)

  28. Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)

    Article  Google Scholar 

  29. Mehra, S., Susan, S.: Improving word recognition in speech transcriptions by decision-level fusion of stemming and two-way phoneme pruning. In: Advanced Computing: 10th International Conference, IACC 2020, Panaji, Goa, India, December 5–6, 2020, Revised Selected Papers, Part I 10, pp. 256–266. Springer, Singapore (2021)

  30. Zhang, Q., Zhang, H., Zhou, K., Zhang, Le.: Developing a physiological signal-based, mean threshold and decision-level fusion algorithm (PMD) for emotion recognition. Tsinghua Sci. Technol. 28(4), 673–685 (2023)

    Article  Google Scholar 

  31. Mehra, S., Susan, S.: Deep fusion framework for speech command recognition using acoustic and linguistic features. Multimed. Tools Appl., 1–25 (2023)

  32. Das, R., Singh, T.D.: Image-text multimodal sentiment analysis framework of assamese news articles using late fusion. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 22, 1–30 (2023)

    Article  Google Scholar 

  33. Zhu, J., Huang, C., De Meo, P.: DFMKE: a dual fusion multi-modal knowledge graph embedding framework for entity alignment. Inf. Fusion 90, 111–119 (2023)

    Article  Google Scholar 

  34. Mehra, S., Susan, S.: Early fusion of phone embeddings for recognition of low-resourced accented speech. In: 2022 4th International Conference on Artificial Intelligence and Speech Technology (AIST), pp. 1–5. IEEE (2022)

  35. Warden, P.: Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018)

  36. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M. et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  37. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7), 2121–2159 (2011)

    MathSciNet  Google Scholar 

  38. Haque, M.A., Verma, A., Alex, J.S.R., Venkatesan, N.: Experimental evaluation of CNN architecture for speech recognition. In: First International Conference on Sustainable Technologies for Computational Intelligence: Proceedings of ICTSCI 2019, pp. 507–514. Springer, Singapore (2020)

  39. Abdelmaksoud, E.R., Hassen, A., Hassan, N., Hesham, M.: Convolutional neural network for Arabic speech recognition. Egypt. J. Lang. Eng. 8(1), 27–38 (2021)

    Article  Google Scholar 

  40. McDermott, E., Sak, H., Variani, E.: A density ratio approach to language model fusion in end-to-end automatic speech recognition. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 434–441. IEEE (2019)

  41. Zia, T., Zahid, U.: Long short-term memory recurrent neural network architectures for Urdu acoustic modeling. Int. J. Speech Technol. 22, 21–30 (2019)

    Article  Google Scholar 

  42. Lezhenin, I., Bogach, N., Pyshkin, E.: Urban sound classification using long short-term memory neural network. In: 2019 federated conference on computer science and information systems (FedCSIS), pp. 57–60 (2019)

  43. Zeng, M., Xiao, N.: Effective combination of DenseNet and BiLSTM for keyword spotting. IEEE Access 7, 10767–10775 (2019)

    Article  Google Scholar 

  44. De Andrade, D.C., Leo, S., Viana, M.L.D.S., Bernkopf, C.: A neural attention model for speech command recognition. arXiv preprint arXiv:1808.08929 (2018)

  45. Wei, Y., Gong, Z., Yang, S., Ye, K., Wen, Y.: EdgeCRNN: an edge-computing oriented model of acoustic feature enhancement for keyword spotting. J. Ambient Intell. Humaniz. Comput., 1–11 (2022)

  46. Cances, L., Pellegrini, T.: Comparison of deep co-training and mean-teacher approaches for semi-supervised audio tagging. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 361–365. IEEE (2021)

  47. Higy, B., Bell, P.: Few-shot learning with attention-based sequence-to-sequence models. arXiv preprint arXiv:1811.03519 (2018).

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

SM contributed to implementation of proposed approach, wrote the manuscript text, and was involved in writing—original draft and visualization. VR was involved in conceptualization, methodology, writing—review and editing, and supervision. RA contributed to conceptualization, methodology, writing—review and editing, and supervision.

Corresponding author

Correspondence to Sunakshi Mehra.

Ethics declarations

Conflict of interest

The authors affirm that they do not possess any identifiable financial or personal affiliations that might have influenced the conclusions presented in this study.

Ethical approval

The article does not include any studies or investigations that involve human or animal participants.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mehra, S., Ranga, V. & Agarwal, R. Improving speech command recognition through decision-level fusion of deep filtered speech cues. SIViP 18, 1365–1373 (2024). https://doi.org/10.1007/s11760-023-02845-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-023-02845-z

Keywords

Navigation