Improving speech command recognition through decision-level fusion of deep filtered speech cues

Mehra, Sunakshi; Ranga, Virender; Agarwal, Ritu

doi:10.1007/s11760-023-02845-z

Improving speech command recognition through decision-level fusion of deep filtered speech cues

Original Paper
Published: 11 November 2023

Volume 18, pages 1365–1373, (2024)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Sunakshi Mehra¹,
Virender Ranga¹ &
Ritu Agarwal¹

221 Accesses
2 Citations
Explore all metrics

Abstract

Living beings communicate through speech, which can be analysed to identify words and sentences by recognizing the flow of spoken utterances. However, background noise will always have an impact on the speech recognition process. The detection rate in the presence of background noise is still unsatisfactory, necessitating further research and potential remedies in the speech recognition process. To improve the noisy speech information, this research suggests speech recognition based on a combination of median filtering and adaptive filtering. In this study, speech command recognition is achieved by employing popular noise reduction techniques and utilizing two parallel channels of filtered speech independently. The procedure involves five steps: firstly, enhancing signals using two parallel independent speech enhancement models (median and adaptive filtering); secondly, extracting 2D Mel spectrogram images from the enhanced signals; and thirdly, passing the 2-dimensional Mel spectrogram images to the tiny Swin Transformer for classification. The classification is performed among the large-scale ImageNet dataset, which consists of 14 million images and is approximately 150 GB in size. Fourth, the posterior probabilities extracted from the tiny Swin Transformer modelling are then fed into our proposed 3-layered feed-forward network for classification among our 10-speech command categories. Lastly, decision-level fusion is applied to the two parallel, independent channels obtained from the 3-layered feed-forward network. For experimentation, the Google Speech Command dataset version 2 is used. We obtained a test accuracy of 99.85% when compared with other state-of-the-art methods, demonstrating satisfactory results that can be reported.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Speaker age and gender recognition using 1D and 2D convolutional neural networks

Article 28 November 2023

Data availability

The authors verify that the data underpinning the results and discoveries of this research can be found within the confines of the article.

References

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet Google Scholar
Iqbal, A., Aftab, S.: A classification framework for software defect prediction using multi-filter feature selection technique and MLP. Int. J. Mod. Educ. Comput. Sci. 12(1), 18 (2020)
Article Google Scholar
Hermansky, H.: The modulation spectrum in the automatic recognition of speech. In: 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, pp. 140–147. IEEE (1997)
Sadhu, S., Hermansky, H.: Importance of different temporal modulations of speech: a tale of two perspectives. In: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
Narayanan, A., Wang, D.: Ideal ratio mask estimation using deep neural networks for robust speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7092–7096. IEEE (2013)
Wang, Z.-Q., Wang, P., Wang, DeLiang: Complex spectral mapping for single-and multi-channel speech enhancement and robust ASR. IEEE/ACM Trans. Audio, Speech, Lang. Process. 28, 1778–1787 (2020)
Article PubMed Google Scholar
Luo, Y., Mesgarani, N.: Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 696–700. IEEE (2018)
Liu, Ze, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. "Swin transformer: Hierarchical vision transformer using shifted windows." In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022. 2021.
Rabiner, L.R., Schafer, R.W.: Introduction to digital speech processing. Found. Trendsin® Signal Process. 1(1–2), 1–194 (2007)
Article Google Scholar
Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Prentice hall, Upper Saddle River (2008)
Google Scholar
Korkmaz, Y., Boyacı, A.: A comprehensive Turkish accent/dialect recognition system using acoustic perceptual formants. Appl. Acoust. 193, 108761 (2022)
Article Google Scholar
Xu, Y., Jun, Du., Dai, L.-R., Lee, C.-H.: A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio, Speech, Lang. Process. 23(1), 7–19 (2014)
Article CAS Google Scholar
Chen, Z.: Noise reduction of bird calls based on a combination of spectral subtraction, Wiener filtering, and Kalman filtering. In: 2022 IEEE 5th International Conference on Automation, Electronics and Electrical Engineering (AUTEEE), pp. 512–517. IEEE (2022)
Liu, S., Geng, M., Shoukang, Hu., Xie, X., Cui, M., Jianwei, Yu., Liu, X., Meng, H.: Recent progress in the CUHK dysarthric speech recognition system. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29, 2267–2281 (2021)
Article Google Scholar
Xiong, F., Barker, J., Christensen, H.: Phonetic analysis of dysarthric speech tempo and applications to robust personalised dysarthric speech recognition. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5836–5840. IEEE (2019)
Korkmaz, Y., Boyacı, A.: Hybrid voice activity detection system based on LSTM and auditory speech features. Biomed. Signal Process. Control 80, 104408 (2023)
Article Google Scholar
Aksënova, A., Chen, Z., Chiu, C.C., van Esch, D., Golik, G., Han, W., King , L. et al.: Accented speech recognition: benchmarking, pre-training, and diverse data. arXiv preprint arXiv:2205.08014 (2022)
Korkmaz, Y., Boyacı, A.: Analysis of speaker's gender effects in voice onset time of Turkish stop consonants. In: 2018 6th International Symposium on Digital Forensic and Security (ISDFS), pp. 1–5. IEEE (2018)
Sadhu, S., Li, R., Hermansky, H.: M-vectors: sub-band based energy modulation features for multi-stream automatic speech recognition. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6545–6549. IEEE (2019)
Das, P., Bhattacharjee, U.: Robust speaker verification using GFCC and joint factor analysis. In: Fifth International Conference on Computing, Communications and Networking Technologies (ICCCNT), pp. 1–4. IEEE (2014)
Shi, X., Yang, H., Zhou, P.: Robust speaker recognition based on improved GFCC. In: 2016 2nd IEEE international conference on computer and communications (ICCC), pp. 1927–1931. IEEE (2016)
Kim, C., Stern, R.M.: Power-normalized cepstral coefficients (PNCC) for robust speech recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 24(7), 1315–1329 (2016)
Article Google Scholar
Badi, A., Ko, K., Ko, H.: Bird sounds classification by combining PNCC and robust Mel-log filter bank features. J. Acoust. Soc. Korea 38(1), 39–46 (2019)
Google Scholar
Passricha, V., Aggarwal, R.K.: A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition. J. Intell. Syst. 29(1), 1261–1274 (2019)
Google Scholar
Mukhamadiyev, A., Khujayarov, I., Djuraev, O., Cho, J.: Automatic speech recognition method based on deep learning approaches for Uzbek language. Sensors 22(10), 3683 (2022)
Article ADS PubMed PubMed Central Google Scholar
Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 24–29. IEEE (2011)
Uebel, L.F., Woodland, P.C.: An investigation into vocal tract length normalisation. In: Sixth European Conference on Speech Communication and Technology (1999)
Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)
Article Google Scholar
Mehra, S., Susan, S.: Improving word recognition in speech transcriptions by decision-level fusion of stemming and two-way phoneme pruning. In: Advanced Computing: 10th International Conference, IACC 2020, Panaji, Goa, India, December 5–6, 2020, Revised Selected Papers, Part I 10, pp. 256–266. Springer, Singapore (2021)
Zhang, Q., Zhang, H., Zhou, K., Zhang, Le.: Developing a physiological signal-based, mean threshold and decision-level fusion algorithm (PMD) for emotion recognition. Tsinghua Sci. Technol. 28(4), 673–685 (2023)
Article Google Scholar
Mehra, S., Susan, S.: Deep fusion framework for speech command recognition using acoustic and linguistic features. Multimed. Tools Appl., 1–25 (2023)
Das, R., Singh, T.D.: Image-text multimodal sentiment analysis framework of assamese news articles using late fusion. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 22, 1–30 (2023)
Article Google Scholar
Zhu, J., Huang, C., De Meo, P.: DFMKE: a dual fusion multi-modal knowledge graph embedding framework for entity alignment. Inf. Fusion 90, 111–119 (2023)
Article Google Scholar
Mehra, S., Susan, S.: Early fusion of phone embeddings for recognition of low-resourced accented speech. In: 2022 4th International Conference on Artificial Intelligence and Speech Technology (AIST), pp. 1–5. IEEE (2022)
Warden, P.: Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M. et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7), 2121–2159 (2011)
MathSciNet Google Scholar
Haque, M.A., Verma, A., Alex, J.S.R., Venkatesan, N.: Experimental evaluation of CNN architecture for speech recognition. In: First International Conference on Sustainable Technologies for Computational Intelligence: Proceedings of ICTSCI 2019, pp. 507–514. Springer, Singapore (2020)
Abdelmaksoud, E.R., Hassen, A., Hassan, N., Hesham, M.: Convolutional neural network for Arabic speech recognition. Egypt. J. Lang. Eng. 8(1), 27–38 (2021)
Article Google Scholar
McDermott, E., Sak, H., Variani, E.: A density ratio approach to language model fusion in end-to-end automatic speech recognition. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 434–441. IEEE (2019)
Zia, T., Zahid, U.: Long short-term memory recurrent neural network architectures for Urdu acoustic modeling. Int. J. Speech Technol. 22, 21–30 (2019)
Article Google Scholar
Lezhenin, I., Bogach, N., Pyshkin, E.: Urban sound classification using long short-term memory neural network. In: 2019 federated conference on computer science and information systems (FedCSIS), pp. 57–60 (2019)
Zeng, M., Xiao, N.: Effective combination of DenseNet and BiLSTM for keyword spotting. IEEE Access 7, 10767–10775 (2019)
Article Google Scholar
De Andrade, D.C., Leo, S., Viana, M.L.D.S., Bernkopf, C.: A neural attention model for speech command recognition. arXiv preprint arXiv:1808.08929 (2018)
Wei, Y., Gong, Z., Yang, S., Ye, K., Wen, Y.: EdgeCRNN: an edge-computing oriented model of acoustic feature enhancement for keyword spotting. J. Ambient Intell. Humaniz. Comput., 1–11 (2022)
Cances, L., Pellegrini, T.: Comparison of deep co-training and mean-teacher approaches for semi-supervised audio tagging. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 361–365. IEEE (2021)
Higy, B., Bell, P.: Few-shot learning with attention-based sequence-to-sequence models. arXiv preprint arXiv:1811.03519 (2018).

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Department of Information Technology, Delhi Technological University, Delhi, India
Sunakshi Mehra, Virender Ranga & Ritu Agarwal

Authors

Sunakshi Mehra
View author publications
You can also search for this author in PubMed Google Scholar
Virender Ranga
View author publications
You can also search for this author in PubMed Google Scholar
Ritu Agarwal
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SM contributed to implementation of proposed approach, wrote the manuscript text, and was involved in writing—original draft and visualization. VR was involved in conceptualization, methodology, writing—review and editing, and supervision. RA contributed to conceptualization, methodology, writing—review and editing, and supervision.

Corresponding author

Correspondence to Sunakshi Mehra.

Ethics declarations

Conflict of interest

The authors affirm that they do not possess any identifiable financial or personal affiliations that might have influenced the conclusions presented in this study.

Ethical approval

The article does not include any studies or investigations that involve human or animal participants.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mehra, S., Ranga, V. & Agarwal, R. Improving speech command recognition through decision-level fusion of deep filtered speech cues. SIViP 18, 1365–1373 (2024). https://doi.org/10.1007/s11760-023-02845-z

Download citation

Received: 10 September 2023
Revised: 29 September 2023
Accepted: 13 October 2023
Published: 11 November 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11760-023-02845-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving speech command recognition through decision-level fusion of deep filtered speech cues

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Speaker age and gender recognition using 1D and 2D convolutional neural networks

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving speech command recognition through decision-level fusion of deep filtered speech cues

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Speaker age and gender recognition using 1D and 2D convolutional neural networks

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation