research-article

End-to-end Multi-modal Low-resourced Speech Keywords Recognition Using Sequential Conv2D Nets

Authors:

Deepak Kumar SharmaAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 23, Issue 1

Article No.: 7, Pages 1 - 21

https://doi.org/10.1145/3606019

Published: 15 January 2024 Publication History

Abstract

Advanced Neural Networks are widely used to recognize multi-modal conversational speech with significant improvements in accuracy automatically. Significantly, Convolutional Neural sheets have retreated cutting-edge performance in Automatic Voice Recognition (AVR) recently more appropriately in English; however, the Hindi language has not been explored and examined well on AVR systems. The work in this article has exposed a three-layered two-dimensional Sequential Convolutional neural architecture. The Sequential Conv2D is an end-to-end system that can instantaneously exploit speech signal spectral and temporal structures. The network has been trained and tested on different cepstral features such as Frequency and Time variant-Mel-Filters, Gamma-tone Filter Cepstral Quantities, Bark-Filter band Coefficients, and Spectrogram features of speech structures. The experiment was performed on two low-resourced speech command datasets; Hindi with 27,145 Speech Keywords developed by TIFR and 23,664 (1-s utterances) of English speech commands by Google TensorFlow and AIY English Speech Commands. The experimental outcome showed that the model achieves significant performance of Convolutional layers trained on spectrograms with 91.60% accuracy, compared to that achieved in other cepstral feature labels for English speech. However, the model achieved an accuracy of 69.65% for Hindi audio words in which bark-frequency cepstral coefficients features outperformed spectrogram features.

References

[1]

P. Gambhir. 2019. Review of Chatbot design and trends. In Proceedings of the Conference on Artificial Intelligence and Speech Technology.

[2]

M. Chellapriyadharshini, A. Toffy, and V. Ramasubramanian. 2018. Semi-supervised and active-learning scenarios: Efficient acoustic model refinement for a low resource Indian language. Retrieved from https://arXiv:1810.06635

[3]

M. Shamsfard. 2019. Challenges and opportunities in processing low resource languages: A study on Persian. In International Conference Language Technologies for All (LT4All).

[4]

Shobha Bhatt, Anurag Jain, and Amita Dev. 2020. Acoustic modeling in speech recognition: A systematic review. Int. J. Adv. Comput. Sci. Appl. 11, 4 (2020), 397–412. DOI:Science and Information Sciences, Web of Science Emerging Sources Citation Index (ESCI).

[5]

Poonam Bansal et al. 2015. The State-of-the-art of feature extraction techniques: An overview. In Proceedings of the Computer Society of India (CSI’15), Speech and Language Processing for Human-Machine Communications, Advances in Intelligent Systems and Computing. Springer, 195–207.

[6]

D. Gillick, S. Wegmann, and L. Gillick. 2012. Discriminative training for speech recognition is compensating for statistical dependence in the HMM framework. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). IEEE, 4745–4748.

[7]

A. Kumar and R. K. Aggarwal. 2020. Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation. Int. J. Speech. Technol. (2020), 1–12.

[8]

S. K. Zhou, D. Rueckert, and G. Fichtinger (Eds.). 2019. Handbook of Medical Image Computing and Computer Assisted Intervention. Academic Press.

[9]

V. Passricha and R. K. Aggarwal. 2019. A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition. J. Intell. Syst. 29, 1 (2019), 1261–1274.

[10]

G. Kurata and K. Audhkhasi. 2018. Improved knowledge distillation from bi- directional to uni-directional LSTM CTC for end-to-end speech recognition. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT’18). IEEE. 411–417.

[11]

W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals. 2015. Listen, attend and spell. Retrieved from https://arXiv:1508.01211

[12]

E. Battenberg, J. Chen, R. Child, A. Coates, Y. G. Y. Li, H. Liu, and Z. Zhu. 2017. Exploring neural transducers for end-to-end speech recognition. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU’17). IEEE. 206–213.

[13]

R. Collobert, C. Puhrsch, and G. Synnaeve. 2016. Wav2letter: An end-to-end convnet-based speech recognition system. Retrieved from https://arXiv:1609.03193

[14]

O. A. Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu. 2014. Convolutional neural network for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22, 10 (2014), 1533–1545.

Digital Library

[15]

D. Guiming, W. Xia, W. Guangyan, Z. Yan, and L. Dan. 2016. Speech recognition based on convolutional neural networks. In Proceedings of the IEEE International Conference on Signal and Image Processing (ICSIP’16). 708–711.

[16]

P. Jansson. 2018. Single-word speech recognition with convolutional neural networks on raw waveforms. https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=P.+Jansson.+2018.+Single-word+speech+recognition+with+convolutional+neural+networks+on+raw+waveforms.&btnG=

[17]

C. Xue, C. Karjadi, I. C. Paschalidis, R. Au, and V. B. Kolachalama. 2021. Detection of dementia on voice recordings using deep learning: A Framingham Heart Study. Alzheimer's Res. Ther. 13, 1 (2021), 1–15.

[18]

D. Nagajyothi and P. Siddaiah. 2018. Speech recognition using convolutional neural networks. Int. J. Eng. Technol. 7, 4.6 (2018).

[19]

D. Guiming, W. Xia, W. Guangyan, Z. Yan, and L. Dan. 2016. Speech recognition based on convolutional neural networks. IEEE International Conference on Signal and Image Processing (ICSIP'16), Beijing, 708--711. DOI:

[20]

Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. L. Y. Bengio, and A. Courville. 2017. Towards end-to-end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720.

[21]

Xuejiao Li and Zixuan Zhou. 2017. Speech command recognition with convolutional neural network. CS229 Stanford Education, Vol. 31.

[22]

J. T. Huang, J. Li, and Y. Gong. 2015. An analysis of convolutional neural networks for speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'15). IEEE, 4989--4993.

[23]

Poonam Bansal, Amita Dev, and Shail Bala Jain. 2011. Normalized autocorrelation-based features for robust speech recognition in context with noisy environment. J. Inf. Comput. Sci. 6, 1 (2011), 55–63.

[24]

Poonam Bansal, Amita Dev, and Shail Bala Jain. 2010. Robust features for noisy speech recognition using MFCC computation from magnitude spectrum of higher order autocorrelation coefficients. Int. J. Comput. Appl. 10, 8 (2010), 36–38.

[25]

Pavel Golik, Zoltán Tüske, Ralf Schlüter, and Hermann Ney. 2015. Convolutional neural networks for acoustic modeling of raw time signal in LVCSR. Retrieved from https://www.isca-speech.org/archive/interspeech_2015/golik15_interspeech.html

[26]

Ossama Abdel-Hamid. 2014. Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 22, 10 (Oct. 2014).

Digital Library

[27]

S. Bhable, A. Lahase, and S. Maher. 2021. Automatic speech recognition (ASR) of isolated words in Hindi low resource language. Int. J. Res. Appl. Sci. Eng. Technol. 9, 2 (2021), 260--265.

[28]

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, and H. Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. Retrieved from https://arXiv:1704.04861

[29]

K. I. Taher and A. M. Abdulazeez. 2021. Deep learning convolutional neural network for speech recognition: A review. Int. J. Sci. Bus. 5, 3 (2021), 1–14.

[30]

G. Dahl, D. Yu, L. Deng, and A. Acero. 2012. Context-dependent pre-traineddeep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio, Speech, Lang. Process. 20, 1 (Jan. 2012), 30–42.

Digital Library

[31]

L. Samarakoon and K. C. Sim. 2016. On combining i-vectors and discriminative adaptation methods for unsupervised speaker normalization in DNN acoustic models. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'16). IEEE, 5275--5279. https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=L.+Samarakoon+and+K.+C.+Sim.+2016.+On+combining+i-vectors+and+discriminative+adaptation+methods+for+unsupervised+472+speaker+normalization+in+DNN+acoustic+models.+In+Proceedings+of+the+IEEE+International+Conference+on+Acoustics%2C+Speech+473+and+Signal+Processing+%28ICASSP%29.+IEEE%2C+5275.5279.&btnG=

Digital Library

[32]

X. Li and Z. Zhou. 2017. Speech command recognition with convolutional neural network. In CS229 Stanford Education, Vol. 31. Retrieved from http://cs229.stanford.edu/proj2017/final-reports/5244201.pdf

[33]

D. Palaz, M. M. Doss, and R. Collobert. 2019. End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition. Speech Commun. 108 (2019), 15–32.

Digital Library

[34]

P. Warden. 2017. Speech Commands: A public Dataset for Single-Word Speech Recognition. Tillgänglig. Retrieved from http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz

[35]

P. P. Shrishrimal, R. R. Deshmukh, and V. B. Waghmare. 2012. Indian language speech database: A review. Int. J. Comput. Appl. 47, 5 (2012), 17–21.

[36]

P. Sweitojamski, A. Ghoshal, and S. Renals. 2014. Convolutional neural networks for distant speech recognition. IEEE Signal Process. Lett 21 (2014), 1120–1124.

[37]

N. Singh, R. A. Khan, and R. Shree. 2012. MFCC and prosodic feature extraction techniques: A comparative study. Int. J. Comput. Appl. 54, 1 (2012).

[38]

S. D. Dhingra, G. Nijhawan, and P. Pandit. 2013. Isolated speech recognition using MFCC and DTW. International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering 2, 8 (2013), 4085--4092.

[39]

R. Patterson and I. N. Smith. 1996. An efficient auditory filter bank based on the gamma-tone-function. In Proceedings of the Speech-Group Meeting of the Institute of Acoustics on Auditory Modelling, vol. 54.

[40]

R. D Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice. 1987. An efficient auditory filter bank based on Gammatone function. In Proceedings of the IOC Speech Group on Auditory Modelling at RSRE.

[41]

M. Jeevan, A. Dhingra, M. Hanmandlu, and B. K. Panigrahi. 2017. Robust speaker verification using GFCC based i-vectors. In Proceedings of the International Conference on Signal, Networks, Computing, and Systems (ICSNCS'16), Vol. 1, Springer, 85--91.

[42]

T. W. Kuan, A. C. Tsai, P. H. Sung, J. F. Wang, and H. S. Kuo. 2016. A robust BFCC feature extraction for ASR system. Artif. Intell. Res. 5, 2 (2016), 14–23.

[43]

T. Gulzar, A. Singh, and S. Sharma. 2014. Comparative analysis of LPCC, MFCC and BFCC for the recognition of Hindi words using artificial neural networks. Int. J. Comput. Appl. 101, 12 (2014), 22–27.

[44]

O. Abdel-Hamid, A. R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu. 2014. Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 22, 10 (2014), 1533–1545.

Digital Library

Cited By

John SVahid ANair LKhan MSaeed F(2024)Overcoming the barrier of emotion in interlingual transcription: A case study of Malayalam to English transcription using convolutional neural networks2024 OPJU International Technology Conference (OTCON) on Smart Computing for Innovation and Advancement in Industry 4.010.1109/OTCON60325.2024.10687437(1-6)Online publication date: 5-Jun-2024
https://doi.org/10.1109/OTCON60325.2024.10687437
B RS SPravin A(2024)Traffic Sign Recognition using Shift In-variant 2-D ConvNet2024 International Conference on Inventive Computation Technologies (ICICT)10.1109/ICICT60155.2024.10544943(135-141)Online publication date: 24-Apr-2024
https://doi.org/10.1109/ICICT60155.2024.10544943
Liu DMao QGao LWang G(2024)Leveraging Contrastive Language–Image Pre-Training and Bidirectional Cross-attention for Multimodal Keyword SpottingEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109403138(109403)Online publication date: Dec-2024
https://doi.org/10.1016/j.engappai.2024.109403
Show More Cited By

Index Terms

End-to-end Multi-modal Low-resourced Speech Keywords Recognition Using Sequential Conv2D Nets
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition

Recommendations

Speech recognition system based on deep neural network acoustic modeling for low resourced language-Amharic
ICTCE '19: Proceedings of the 3rd International Conference on Telecommunications and Communication Engineering

In this paper automatic speech recognition is investigated using deep neural network (DNN) acoustic modeling method for Amharic language at syllabic acoustic units. In grapheme based database; graphemes/characters are basic units of lexicon and language ...
Speech recognition in a dialog system: from conventional to deep processing

The aim of this paper is to illustrate an overview of the automatic speech recognition (ASR) module in a spoken dialog system and how it has evolved from the conventional GMM-HMM (Gaussian mixture model - hidden Markov model) architecture toward the ...
Speech-Input Speech-Output Communication for Dysarthric Speakers Using HMM-Based Speech Recognition and Adaptive Synthesis System

Dysarthria is a motor speech disorder that causes inability to control and coordinate one or more articulators. This makes it difficult for a dysarthric speaker to utter certain speech sound units, thereby producing poorly articulated, slurred, and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 23, Issue 1

January 2024

385 pages

EISSN:2375-4702

DOI:10.1145/3613498

Editor:
Imed Zitoun
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 January 2024

Online AM: 24 July 2023

Accepted: 21 May 2023

Revised: 22 April 2023

Received: 21 September 2022

Published in TALLIP Volume 23, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
337
Total Downloads

Downloads (Last 12 months)234
Downloads (Last 6 weeks)20

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

John SVahid ANair LKhan MSaeed F(2024)Overcoming the barrier of emotion in interlingual transcription: A case study of Malayalam to English transcription using convolutional neural networks2024 OPJU International Technology Conference (OTCON) on Smart Computing for Innovation and Advancement in Industry 4.010.1109/OTCON60325.2024.10687437(1-6)Online publication date: 5-Jun-2024
https://doi.org/10.1109/OTCON60325.2024.10687437
B RS SPravin A(2024)Traffic Sign Recognition using Shift In-variant 2-D ConvNet2024 International Conference on Inventive Computation Technologies (ICICT)10.1109/ICICT60155.2024.10544943(135-141)Online publication date: 24-Apr-2024
https://doi.org/10.1109/ICICT60155.2024.10544943
Liu DMao QGao LWang G(2024)Leveraging Contrastive Language–Image Pre-Training and Bidirectional Cross-attention for Multimodal Keyword SpottingEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109403138(109403)Online publication date: Dec-2024
https://doi.org/10.1016/j.engappai.2024.109403
Tyagi VDev ABansal P(2024)Speech Recognition Using Adaptation of Whisper ModelsArtificial Intelligence and Speech Technology10.1007/978-3-031-75164-6_24(323-334)Online publication date: 24-Nov-2024
https://doi.org/10.1007/978-3-031-75164-6_24
Bhatt NBhatt SGarg G(2024)Hindi Speech Recognition Using Deep Learning: A ReviewArtificial Intelligence and Speech Technology10.1007/978-3-031-75164-6_17(227-237)Online publication date: 24-Nov-2024
https://doi.org/10.1007/978-3-031-75164-6_17
Paul SBhattacharjee VSaha S(2024)Kannada Continuous Speech Recognition Using Deep LearningAdvanced Network Technologies and Intelligent Computing10.1007/978-3-031-64067-4_17(258-269)Online publication date: 8-Aug-2024
https://doi.org/10.1007/978-3-031-64067-4_17
Tyagi VDev ABansal P(2023)Analysis and Classification of Dysarthric Speech2023 26th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)10.1109/O-COCOSDA60357.2023.10482956(1-6)Online publication date: 4-Dec-2023
https://doi.org/10.1109/O-COCOSDA60357.2023.10482956
Boulila WAlzahem AKoubaa ABenjdira BAmmar A(2023)Early detection of red palm weevil infestations using deep learning classification of acoustic signalsComputers and Electronics in Agriculture10.1016/j.compag.2023.108154212:COnline publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1016/j.compag.2023.108154

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents