Abstract
Living beings communicate through speech, which can be analysed to identify words and sentences by recognizing the flow of spoken utterances. However, background noise will always have an impact on the speech recognition process. The detection rate in the presence of background noise is still unsatisfactory, necessitating further research and potential remedies in the speech recognition process. To improve the noisy speech information, this research suggests speech recognition based on a combination of median filtering and adaptive filtering. In this study, speech command recognition is achieved by employing popular noise reduction techniques and utilizing two parallel channels of filtered speech independently. The procedure involves five steps: firstly, enhancing signals using two parallel independent speech enhancement models (median and adaptive filtering); secondly, extracting 2D Mel spectrogram images from the enhanced signals; and thirdly, passing the 2-dimensional Mel spectrogram images to the tiny Swin Transformer for classification. The classification is performed among the large-scale ImageNet dataset, which consists of 14 million images and is approximately 150 GB in size. Fourth, the posterior probabilities extracted from the tiny Swin Transformer modelling are then fed into our proposed 3-layered feed-forward network for classification among our 10-speech command categories. Lastly, decision-level fusion is applied to the two parallel, independent channels obtained from the 3-layered feed-forward network. For experimentation, the Google Speech Command dataset version 2 is used. We obtained a test accuracy of 99.85% when compared with other state-of-the-art methods, demonstrating satisfactory results that can be reported.
Similar content being viewed by others
Data availability
The authors verify that the data underpinning the results and discoveries of this research can be found within the confines of the article.
References
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Iqbal, A., Aftab, S.: A classification framework for software defect prediction using multi-filter feature selection technique and MLP. Int. J. Mod. Educ. Comput. Sci. 12(1), 18 (2020)
Hermansky, H.: The modulation spectrum in the automatic recognition of speech. In: 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, pp. 140–147. IEEE (1997)
Sadhu, S., Hermansky, H.: Importance of different temporal modulations of speech: a tale of two perspectives. In: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
Narayanan, A., Wang, D.: Ideal ratio mask estimation using deep neural networks for robust speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7092–7096. IEEE (2013)
Wang, Z.-Q., Wang, P., Wang, DeLiang: Complex spectral mapping for single-and multi-channel speech enhancement and robust ASR. IEEE/ACM Trans. Audio, Speech, Lang. Process. 28, 1778–1787 (2020)
Luo, Y., Mesgarani, N.: Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 696–700. IEEE (2018)
Liu, Ze, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. "Swin transformer: Hierarchical vision transformer using shifted windows." In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022. 2021.
Rabiner, L.R., Schafer, R.W.: Introduction to digital speech processing. Found. Trendsin® Signal Process. 1(1–2), 1–194 (2007)
Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Prentice hall, Upper Saddle River (2008)
Korkmaz, Y., Boyacı, A.: A comprehensive Turkish accent/dialect recognition system using acoustic perceptual formants. Appl. Acoust. 193, 108761 (2022)
Xu, Y., Jun, Du., Dai, L.-R., Lee, C.-H.: A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio, Speech, Lang. Process. 23(1), 7–19 (2014)
Chen, Z.: Noise reduction of bird calls based on a combination of spectral subtraction, Wiener filtering, and Kalman filtering. In: 2022 IEEE 5th International Conference on Automation, Electronics and Electrical Engineering (AUTEEE), pp. 512–517. IEEE (2022)
Liu, S., Geng, M., Shoukang, Hu., Xie, X., Cui, M., Jianwei, Yu., Liu, X., Meng, H.: Recent progress in the CUHK dysarthric speech recognition system. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29, 2267–2281 (2021)
Xiong, F., Barker, J., Christensen, H.: Phonetic analysis of dysarthric speech tempo and applications to robust personalised dysarthric speech recognition. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5836–5840. IEEE (2019)
Korkmaz, Y., Boyacı, A.: Hybrid voice activity detection system based on LSTM and auditory speech features. Biomed. Signal Process. Control 80, 104408 (2023)
Aksënova, A., Chen, Z., Chiu, C.C., van Esch, D., Golik, G., Han, W., King , L. et al.: Accented speech recognition: benchmarking, pre-training, and diverse data. arXiv preprint arXiv:2205.08014 (2022)
Korkmaz, Y., Boyacı, A.: Analysis of speaker's gender effects in voice onset time of Turkish stop consonants. In: 2018 6th International Symposium on Digital Forensic and Security (ISDFS), pp. 1–5. IEEE (2018)
Sadhu, S., Li, R., Hermansky, H.: M-vectors: sub-band based energy modulation features for multi-stream automatic speech recognition. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6545–6549. IEEE (2019)
Das, P., Bhattacharjee, U.: Robust speaker verification using GFCC and joint factor analysis. In: Fifth International Conference on Computing, Communications and Networking Technologies (ICCCNT), pp. 1–4. IEEE (2014)
Shi, X., Yang, H., Zhou, P.: Robust speaker recognition based on improved GFCC. In: 2016 2nd IEEE international conference on computer and communications (ICCC), pp. 1927–1931. IEEE (2016)
Kim, C., Stern, R.M.: Power-normalized cepstral coefficients (PNCC) for robust speech recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 24(7), 1315–1329 (2016)
Badi, A., Ko, K., Ko, H.: Bird sounds classification by combining PNCC and robust Mel-log filter bank features. J. Acoust. Soc. Korea 38(1), 39–46 (2019)
Passricha, V., Aggarwal, R.K.: A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition. J. Intell. Syst. 29(1), 1261–1274 (2019)
Mukhamadiyev, A., Khujayarov, I., Djuraev, O., Cho, J.: Automatic speech recognition method based on deep learning approaches for Uzbek language. Sensors 22(10), 3683 (2022)
Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 24–29. IEEE (2011)
Uebel, L.F., Woodland, P.C.: An investigation into vocal tract length normalisation. In: Sixth European Conference on Speech Communication and Technology (1999)
Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)
Mehra, S., Susan, S.: Improving word recognition in speech transcriptions by decision-level fusion of stemming and two-way phoneme pruning. In: Advanced Computing: 10th International Conference, IACC 2020, Panaji, Goa, India, December 5–6, 2020, Revised Selected Papers, Part I 10, pp. 256–266. Springer, Singapore (2021)
Zhang, Q., Zhang, H., Zhou, K., Zhang, Le.: Developing a physiological signal-based, mean threshold and decision-level fusion algorithm (PMD) for emotion recognition. Tsinghua Sci. Technol. 28(4), 673–685 (2023)
Mehra, S., Susan, S.: Deep fusion framework for speech command recognition using acoustic and linguistic features. Multimed. Tools Appl., 1–25 (2023)
Das, R., Singh, T.D.: Image-text multimodal sentiment analysis framework of assamese news articles using late fusion. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 22, 1–30 (2023)
Zhu, J., Huang, C., De Meo, P.: DFMKE: a dual fusion multi-modal knowledge graph embedding framework for entity alignment. Inf. Fusion 90, 111–119 (2023)
Mehra, S., Susan, S.: Early fusion of phone embeddings for recognition of low-resourced accented speech. In: 2022 4th International Conference on Artificial Intelligence and Speech Technology (AIST), pp. 1–5. IEEE (2022)
Warden, P.: Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M. et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7), 2121–2159 (2011)
Haque, M.A., Verma, A., Alex, J.S.R., Venkatesan, N.: Experimental evaluation of CNN architecture for speech recognition. In: First International Conference on Sustainable Technologies for Computational Intelligence: Proceedings of ICTSCI 2019, pp. 507–514. Springer, Singapore (2020)
Abdelmaksoud, E.R., Hassen, A., Hassan, N., Hesham, M.: Convolutional neural network for Arabic speech recognition. Egypt. J. Lang. Eng. 8(1), 27–38 (2021)
McDermott, E., Sak, H., Variani, E.: A density ratio approach to language model fusion in end-to-end automatic speech recognition. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 434–441. IEEE (2019)
Zia, T., Zahid, U.: Long short-term memory recurrent neural network architectures for Urdu acoustic modeling. Int. J. Speech Technol. 22, 21–30 (2019)
Lezhenin, I., Bogach, N., Pyshkin, E.: Urban sound classification using long short-term memory neural network. In: 2019 federated conference on computer science and information systems (FedCSIS), pp. 57–60 (2019)
Zeng, M., Xiao, N.: Effective combination of DenseNet and BiLSTM for keyword spotting. IEEE Access 7, 10767–10775 (2019)
De Andrade, D.C., Leo, S., Viana, M.L.D.S., Bernkopf, C.: A neural attention model for speech command recognition. arXiv preprint arXiv:1808.08929 (2018)
Wei, Y., Gong, Z., Yang, S., Ye, K., Wen, Y.: EdgeCRNN: an edge-computing oriented model of acoustic feature enhancement for keyword spotting. J. Ambient Intell. Humaniz. Comput., 1–11 (2022)
Cances, L., Pellegrini, T.: Comparison of deep co-training and mean-teacher approaches for semi-supervised audio tagging. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 361–365. IEEE (2021)
Higy, B., Bell, P.: Few-shot learning with attention-based sequence-to-sequence models. arXiv preprint arXiv:1811.03519 (2018).
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
SM contributed to implementation of proposed approach, wrote the manuscript text, and was involved in writing—original draft and visualization. VR was involved in conceptualization, methodology, writing—review and editing, and supervision. RA contributed to conceptualization, methodology, writing—review and editing, and supervision.
Corresponding author
Ethics declarations
Conflict of interest
The authors affirm that they do not possess any identifiable financial or personal affiliations that might have influenced the conclusions presented in this study.
Ethical approval
The article does not include any studies or investigations that involve human or animal participants.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mehra, S., Ranga, V. & Agarwal, R. Improving speech command recognition through decision-level fusion of deep filtered speech cues. SIViP 18, 1365–1373 (2024). https://doi.org/10.1007/s11760-023-02845-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-023-02845-z