Deep fusion framework for speech command recognition using acoustic and linguistic features

Mehra, Sunakshi; Susan, Seba

doi:10.1007/s11042-023-15118-1

Deep fusion framework for speech command recognition using acoustic and linguistic features

Published: 23 March 2023

Volume 82, pages 38667–38691, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

395 Accesses
4 Citations
Explore all metrics

Abstract

The research problem addressed in this study is how to effectively combine multimodal data from imperfect text transcripts and raw audio in a deep framework for automatic speech recognition. In this study, we suggest combining audio and text modalities late in the process. We propose a self-attention based deep bidirectional long short-term memory (SA-deep BiLSTM) for processing audio and text data independently. For training each type of feature, we use the SA-deep BiLSTM model which comprises of five BiLSTM layers and a self-attention module between the third and fourth layers. The linguistic data, like the word stem extracted from the text transcript, and acoustic features like Mel frequency cepstral coefficients (MFCC) and Mel-spectrogram are taken into consideration. The GloVe word embedding is used to vectorize the linguistic data. By fusing the posterior class probabilities of SA-deep BiLSTM models trained on individual modalities, we were able to achieve an accuracy of 98.80% on the 10-word categories of the Google speech command dataset. Numerous tests using the Google speech command dataset and ablation analysis prove that the suggested method performs better than the state of the art because of the high classification accuracies attained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review on the long short-term memory model

Article 13 May 2020

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Data availability

The data will be made available by the authors on request.

Notes

https://github.com/sunakshimehra/Deep-Fusion-Framework-for-Speech-Command-Recognition-using-Acoustic-and-Linguistic-Features

References

Abdelmaksoud ER, Hassen A, Hassan N, Hesham M (2021) Convolutional neural network for arabic speech recognition. The Egypt J Lang Eng 8(1):27–38
Aldarmaki H, Ullah A, Ram S, Zaki N (2022) Unsupervised automatic speech recognition: a review. Speech Comm 139:76–91
Article Google Scholar
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Bastanfard, Azam, Mohammad Aghaahmadi, Maryam Fazel, and Maedeh Moghadam (2009) Persian viseme classification for developing visual speech training application. In Pacific-Rim Conference on Multimedia, 1080–1085. Springer, Berlin, Heidelberg
Bastanfard A, Amirkhani D, Naderi S (2020) A singing voice separation method from Persian music based on pitch detection methods. In 2020 6th Iranian conference on signal processing and intelligent systems (ICSPIS), 1–7. IEEE
Boigne J, Liyanage B, Östrem T (2020) Recognizing more emotions with less data using self-supervised transfer learning. arXiv preprint arXiv:2011.05585
Cabrera R, Liu X, Ghodsi M, Matteson Z, Weinstein E, Kannan A (2021) Language model fusion for streaming end to end speech recognition. arXiv preprint arXiv:2104.04487
Cances L, Pellegrini T (2021) Comparison of deep co-training and mean-teacher approaches for semi-supervised audio tagging. In ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 361–365. IEEE
Cheng J, Dong L, Lapata M (2016) Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733
Chi P-H, Chung V, Wu T-H, Hsieh C-C, Chen Y-H, Li S-W, Lee H-y (2021) Audio albert: A lite bert for self-supervised learning of audio representation. In 2021 IEEE Spoken Language Technology Workshop (SLT), 344–350. IEEE
Cui Z, Ke R, Ziyuan P, Wang Y (2020) Stacked bidirectional and unidirectional LSTM recurrent neural network for forecasting network-wide traffic state with missing values. Transport Res Part C: Emerg Technol 118:102674
Article Google Scholar
Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 28(4):357–366
Article Google Scholar
De Andrade DC, Leo S, Da Silva Viana ML, Bernkopf C (2018) A neural attention model for speech command recognition. arXiv preprint arXiv:1808.08929
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(7)
Fellbaum C (1998) A semantic network of English verbs. WordNet: An electronic lexical database 3:153–178
Google Scholar
Gallardo-Antolín A, Montero JM (2021) On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification. Neurocomputing 456:49–60
Article Google Scholar
Haque MA, Verma A, Alex JSR, Venkatesan N (2020) Experimental evaluation of CNN architecture for speech recognition. In: In First international conference on sustainable technologies for computational intelligence. Springer, Singapore, 507–514
Google Scholar
Higy B, Bell P (2018) Few-shot learning with attention-based sequence-to-sequence models. arXiv preprint arXiv:1811.03519
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hyder R, Ghaffarzadegan S, Feng Z, Hansen JHL, Hasan T (2017) Acoustic scene classification using a CNN-supervector system trained with auditory and spectrogram image features. In Interspeech, 3073–3077
Kardakis S, Perikos I, Grivokostopoulou F, Hatzilygeroudis I (2021) Examining attention mechanisms in deep learning models for sentiment analysis. Appl Sci 11(9):3883
Article Google Scholar
Kim S, Shangguan Y, Mahadeokar J, Bruguier A, Fuegen C, Seltzer ML, Le D (2021) Improved neural language model fusion for streaming recurrent neural network transducer. In ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7333–7337. IEEE, 2021
Kumaran U, Radha Rammohan S, Nagarajan SM, Prathik A (2021) Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN. International Journal of Speech Technology 24(2):303–314
Article Google Scholar
Le D, Jain M, Keren G, Kim S, Shi Y, Mahadeokar J, Chan J, et al. (2021) Contextualized streaming end-to-end speech recognition with trie-based deep biasing and shallow fusion. arXiv preprint arXiv:2104.02194
Lezhenin I, Bogach N, Pyshkin E (2019) Urban sound classification using long short-term memory neural network. In 2019 federated conference on computer science and information systems (FedCSIS), 57–60. IEEE
Li J, Han L, Li X, Zhu J, Yuan B, Gou Z (2021) An evaluation of deep neural network models for music classification using spectrograms. Multimed Tools Appl 81:1–27
Google Scholar
Lin Z, Feng M, dos Santos CN, Yu M, Xiang B, Zhou B, Bengio Y (2017) A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130
Lin JC-W, Shao Y, Djenouri Y, Yun U (2021) ASRNN: a recurrent neural network with an attention model for sequence labeling. Knowledge-Based Systems 212:106548
Article Google Scholar
Liu GK (2018) Evaluating gammatone frequency cepstral coefficients with neural networks for emotion recognition from speech. arXiv preprint arXiv:1806.09010
Macary M, Tahon M, Estève Y, Rousseau A (2021) On the use of self-supervised pre-trained acoustic and linguistic features for continuous speech emotion recognition. In 2021 IEEE Spoken Language Technology Workshop (SLT), 373–380. IEEE
Mahdavi R, Bastanfard A, Amirkhani D (2020) Persian accents identification using modeling of speech articulatory features. In 2020 25th international computer conference, Computer Society of Iran (CSICC), 1–9. IEEE
Marslen-Wilson WD (1987) Functional parallelism in spoken word-recognition. Cognition 25(1–2):71–102
Article Google Scholar
McDermott E, Sak H, Variani E (2019) A density ratio approach to language model fusion in end-to-end automatic speech recognition. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 434–441. IEEE
Meghanani A, Anoop CS, Ramakrishnan AG (2021) An exploration of log-mel spectrogram and MFCC features for Alzheimer’s dementia recognition from spontaneous speech. In 2021IEEE Spoken Language Technology Workshop (SLT), 670–677. IEEE
Mehra S, Susan S (2020) Improving word recognition in speech transcriptions by decision-level fusion of stemming and two-way phoneme pruning. In International Advanced Computing Conference, 256–266. Springer, Singapore
Minoofam SAH, Bastanfard A, Keyvanpour MR (2021) TRCLA: A transfer learning approach to reduce negative transfer for cellular learning automata. IEEE Trans Neural Netw Learn Syst
Nagrani A, Yang S, Arnab A, Jansen A, Schmid C, Sun C (2021) Attention bottlenecks for multimodal fusion. arXiv preprint arXiv:2107.00135
Oganyan M, Wright RA (2022) The role of the root in spoken word recognition in Hebrew: an auditory gating paradigm. Brain Sci 12(6):750
Article Google Scholar
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532–1543
Phaye SSR, Benetos E, Wang Y (2019) Subspectralnet–using sub- spectrogram based convolutional neural networks for acoustic scene classification. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 825–829. IEEE
Porter M (1999) Porter stemming algorithm. 2012-12-06]. http://tartarus.org/-martin/PorterStemmer
Ravuri S, Stolcke A (2015) Recurrent neural network and LSTM models for lexical utterance classification. In Sixteenth Annual Conference of the International Speech Communication Association
Sakashita Y, Aono M (2018) Acoustic scene classification by ensemble of spectrograms based on adaptive temporal divisions. Detection and Classification of Acoustic Scenes and Events(DCASE) Challenge
Schneider S, Baevski A, Collobert R, Auli M (2019) wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862
Shah VH, Chandra M (2021) Speech recognition using spectrogram-based visual features. In Advances in Machine Learning and Computational Intelligence, 695–704. Springer, Singapore
Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, et al. (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 4779–4783. IEEE
Su Y, Zhang K, Wang J, Zhou D, Madani K (2020) Performance analysis of multiple aggregated acoustic features for environment sound classification. Appl Acoust 158:107050
Article Google Scholar
Susan S, Kaur A (2017) Measuring the randomness of speech cues for emotion recognition. In 2017 Tenth International Conference on Contemporary Computing (IC3), 1–6. IEEE
Susan S, Malhotra J (2019) CNN pre-initialization by minimalistic part-learning for handwritten numeral recognition. In International Conference on Mining Intelligence and Knowledge Exploration, 320–329. Springer, Cham
Susan S, Malhotra J (2021) Learning image by-parts using early and late fusion of auto-encoder features. Multimed Tools Appl 80(19):29601–29615
Article Google Scholar
Susan S, Sharma S (2012) A fuzzy nearest neighbor classifier for speaker identification. In: 2012 Fourth International Conference on Computational Intelligence and Communication Networks, IEEE, pp 842–845
Tripathi M, Singh D, Susan S (2020) Speaker recognition using SincNet and X-Vector fusion. In International Conference on Artificial Intelligence and Soft Computing, 252–260. Springer, Cham
Tur G, De Mori R (2011) Spoken language understanding: Systems for extracting semantic information from speech. John Wiley & Sons
Veisi H, Ghoreishi SA, Bastanfard A (2021) Spoken term detection for Persian news of Islamic Republic of Iran broadcasting. Signal and Data Processing 17(4):67–88
Article Google Scholar
Warden P (2018) Speech commands: A dataset for limited-vocabularyspeech recognition. arXiv preprint arXiv:1804.03209
Wazir ASMB, Chuah JH (2019) Spoken arabic digits recognition using deep learning. In 2019 IEEE International Conference on Automatic Control and Intelligent Systems (I2CACIS), 339–344. IEEE
Wei Y, Zheng G, Yang S, Ye K, Wen Y (2021) EdgeCRNN: an edge-computing oriented model of acoustic feature enhancement for keyword spotting. J Ambient Intell Humaniz Comput 13:1–11
Google Scholar
Yi C, Zhou S, Bo X (2021) Efficiently fusing pretrained acoustic and linguistic encoders for low-resource speech recognition. IEEE Signal Processing Letters 28:788–792
Article Google Scholar
Zeng M, Xiao N (2019) Effective combination of DenseNet and BiLSTM for keyword spotting. IEEE Access 7:10767–10775
Article Google Scholar
Zhang S, Yi J, Tian Z, Bai Y, Tao J (2021) Decoupling pronunciation and language for end-to-end code- switching automatic speech recognition. In ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6249–6253. IEEE
Zheng R, Chen J, Ma M, Huang L (2021) Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation. arXiv preprint arXiv:2102.05766
Zia T, Zahid U (2019) Long short-term memory recurrent neural network architectures for Urdu acoustic modeling. International Journal of Speech Technology 22(1):21–30
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Technology, Delhi Technological University, Delhi, India
Sunakshi Mehra & Seba Susan

Authors

Sunakshi Mehra
View author publications
You can also search for this author in PubMed Google Scholar
Seba Susan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sunakshi Mehra.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mehra, S., Susan, S. Deep fusion framework for speech command recognition using acoustic and linguistic features. Multimed Tools Appl 82, 38667–38691 (2023). https://doi.org/10.1007/s11042-023-15118-1

Download citation

Received: 31 December 2021
Revised: 06 October 2022
Accepted: 13 March 2023
Published: 23 March 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s11042-023-15118-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep fusion framework for speech command recognition using acoustic and linguistic features

Abstract

Access this article

Similar content being viewed by others

A review on the long short-term memory model

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep fusion framework for speech command recognition using acoustic and linguistic features

Abstract

Access this article

Similar content being viewed by others

A review on the long short-term memory model

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation