skip to main content
research-article

End-to-end Multi-modal Low-resourced Speech Keywords Recognition Using Sequential Conv2D Nets

Published: 15 January 2024 Publication History

Abstract

Advanced Neural Networks are widely used to recognize multi-modal conversational speech with significant improvements in accuracy automatically. Significantly, Convolutional Neural sheets have retreated cutting-edge performance in Automatic Voice Recognition (AVR) recently more appropriately in English; however, the Hindi language has not been explored and examined well on AVR systems. The work in this article has exposed a three-layered two-dimensional Sequential Convolutional neural architecture. The Sequential Conv2D is an end-to-end system that can instantaneously exploit speech signal spectral and temporal structures. The network has been trained and tested on different cepstral features such as Frequency and Time variant-Mel-Filters, Gamma-tone Filter Cepstral Quantities, Bark-Filter band Coefficients, and Spectrogram features of speech structures. The experiment was performed on two low-resourced speech command datasets; Hindi with 27,145 Speech Keywords developed by TIFR and 23,664 (1-s utterances) of English speech commands by Google TensorFlow and AIY English Speech Commands. The experimental outcome showed that the model achieves significant performance of Convolutional layers trained on spectrograms with 91.60% accuracy, compared to that achieved in other cepstral feature labels for English speech. However, the model achieved an accuracy of 69.65% for Hindi audio words in which bark-frequency cepstral coefficients features outperformed spectrogram features.

References

[1]
P. Gambhir. 2019. Review of Chatbot design and trends. In Proceedings of the Conference on Artificial Intelligence and Speech Technology.
[2]
M. Chellapriyadharshini, A. Toffy, and V. Ramasubramanian. 2018. Semi-supervised and active-learning scenarios: Efficient acoustic model refinement for a low resource Indian language. Retrieved from https://arXiv:1810.06635
[3]
M. Shamsfard. 2019. Challenges and opportunities in processing low resource languages: A study on Persian. In International Conference Language Technologies for All (LT4All).
[4]
Shobha Bhatt, Anurag Jain, and Amita Dev. 2020. Acoustic modeling in speech recognition: A systematic review. Int. J. Adv. Comput. Sci. Appl. 11, 4 (2020), 397–412. DOI:Science and Information Sciences, Web of Science Emerging Sources Citation Index (ESCI).
[5]
Poonam Bansal et al. 2015. The State-of-the-art of feature extraction techniques: An overview. In Proceedings of the Computer Society of India (CSI’15), Speech and Language Processing for Human-Machine Communications, Advances in Intelligent Systems and Computing. Springer, 195–207.
[6]
D. Gillick, S. Wegmann, and L. Gillick. 2012. Discriminative training for speech recognition is compensating for statistical dependence in the HMM framework. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). IEEE, 4745–4748.
[7]
A. Kumar and R. K. Aggarwal. 2020. Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation. Int. J. Speech. Technol. (2020), 1–12.
[8]
S. K. Zhou, D. Rueckert, and G. Fichtinger (Eds.). 2019. Handbook of Medical Image Computing and Computer Assisted Intervention. Academic Press.
[9]
V. Passricha and R. K. Aggarwal. 2019. A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition. J. Intell. Syst. 29, 1 (2019), 1261–1274.
[10]
G. Kurata and K. Audhkhasi. 2018. Improved knowledge distillation from bi- directional to uni-directional LSTM CTC for end-to-end speech recognition. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT’18). IEEE. 411–417.
[11]
W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals. 2015. Listen, attend and spell. Retrieved from https://arXiv:1508.01211
[12]
E. Battenberg, J. Chen, R. Child, A. Coates, Y. G. Y. Li, H. Liu, and Z. Zhu. 2017. Exploring neural transducers for end-to-end speech recognition. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU’17). IEEE. 206–213.
[13]
R. Collobert, C. Puhrsch, and G. Synnaeve. 2016. Wav2letter: An end-to-end convnet-based speech recognition system. Retrieved from https://arXiv:1609.03193
[14]
O. A. Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu. 2014. Convolutional neural network for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22, 10 (2014), 1533–1545.
[15]
D. Guiming, W. Xia, W. Guangyan, Z. Yan, and L. Dan. 2016. Speech recognition based on convolutional neural networks. In Proceedings of the IEEE International Conference on Signal and Image Processing (ICSIP’16). 708–711.
[17]
C. Xue, C. Karjadi, I. C. Paschalidis, R. Au, and V. B. Kolachalama. 2021. Detection of dementia on voice recordings using deep learning: A Framingham Heart Study. Alzheimer's Res. Ther. 13, 1 (2021), 1–15.
[18]
D. Nagajyothi and P. Siddaiah. 2018. Speech recognition using convolutional neural networks. Int. J. Eng. Technol. 7, 4.6 (2018).
[19]
D. Guiming, W. Xia, W. Guangyan, Z. Yan, and L. Dan. 2016. Speech recognition based on convolutional neural networks. IEEE International Conference on Signal and Image Processing (ICSIP'16), Beijing, 708--711. DOI:
[20]
Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. L. Y. Bengio, and A. Courville. 2017. Towards end-to-end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720.
[21]
Xuejiao Li and Zixuan Zhou. 2017. Speech command recognition with convolutional neural network. CS229 Stanford Education, Vol. 31.
[22]
J. T. Huang, J. Li, and Y. Gong. 2015. An analysis of convolutional neural networks for speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'15). IEEE, 4989--4993.
[23]
Poonam Bansal, Amita Dev, and Shail Bala Jain. 2011. Normalized autocorrelation-based features for robust speech recognition in context with noisy environment. J. Inf. Comput. Sci. 6, 1 (2011), 55–63.
[24]
Poonam Bansal, Amita Dev, and Shail Bala Jain. 2010. Robust features for noisy speech recognition using MFCC computation from magnitude spectrum of higher order autocorrelation coefficients. Int. J. Comput. Appl. 10, 8 (2010), 36–38.
[25]
Pavel Golik, Zoltán Tüske, Ralf Schlüter, and Hermann Ney. 2015. Convolutional neural networks for acoustic modeling of raw time signal in LVCSR. Retrieved from https://www.isca-speech.org/archive/interspeech_2015/golik15_interspeech.html
[26]
Ossama Abdel-Hamid. 2014. Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 22, 10 (Oct. 2014).
[27]
S. Bhable, A. Lahase, and S. Maher. 2021. Automatic speech recognition (ASR) of isolated words in Hindi low resource language. Int. J. Res. Appl. Sci. Eng. Technol. 9, 2 (2021), 260--265.
[28]
A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, and H. Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. Retrieved from https://arXiv:1704.04861
[29]
K. I. Taher and A. M. Abdulazeez. 2021. Deep learning convolutional neural network for speech recognition: A review. Int. J. Sci. Bus. 5, 3 (2021), 1–14.
[30]
G. Dahl, D. Yu, L. Deng, and A. Acero. 2012. Context-dependent pre-traineddeep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio, Speech, Lang. Process. 20, 1 (Jan. 2012), 30–42.
[31]
L. Samarakoon and K. C. Sim. 2016. On combining i-vectors and discriminative adaptation methods for unsupervised speaker normalization in DNN acoustic models. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'16). IEEE, 5275--5279. https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=L.+Samarakoon+and+K.+C.+Sim.+2016.+On+combining+i-vectors+and+discriminative+adaptation+methods+for+unsupervised+472+speaker+normalization+in+DNN+acoustic+models.+In+Proceedings+of+the+IEEE+International+Conference+on+Acoustics%2C+Speech+473+and+Signal+Processing+%28ICASSP%29.+IEEE%2C+5275.5279.&btnG=
[32]
X. Li and Z. Zhou. 2017. Speech command recognition with convolutional neural network. In CS229 Stanford Education, Vol. 31. Retrieved from http://cs229.stanford.edu/proj2017/final-reports/5244201.pdf
[33]
D. Palaz, M. M. Doss, and R. Collobert. 2019. End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition. Speech Commun. 108 (2019), 15–32.
[34]
P. Warden. 2017. Speech Commands: A public Dataset for Single-Word Speech Recognition. Tillgänglig. Retrieved from http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz
[35]
P. P. Shrishrimal, R. R. Deshmukh, and V. B. Waghmare. 2012. Indian language speech database: A review. Int. J. Comput. Appl. 47, 5 (2012), 17–21.
[36]
P. Sweitojamski, A. Ghoshal, and S. Renals. 2014. Convolutional neural networks for distant speech recognition. IEEE Signal Process. Lett 21 (2014), 1120–1124.
[37]
N. Singh, R. A. Khan, and R. Shree. 2012. MFCC and prosodic feature extraction techniques: A comparative study. Int. J. Comput. Appl. 54, 1 (2012).
[38]
S. D. Dhingra, G. Nijhawan, and P. Pandit. 2013. Isolated speech recognition using MFCC and DTW. International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering 2, 8 (2013), 4085--4092.
[39]
R. Patterson and I. N. Smith. 1996. An efficient auditory filter bank based on the gamma-tone-function. In Proceedings of the Speech-Group Meeting of the Institute of Acoustics on Auditory Modelling, vol. 54.
[40]
R. D Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice. 1987. An efficient auditory filter bank based on Gammatone function. In Proceedings of the IOC Speech Group on Auditory Modelling at RSRE.
[41]
M. Jeevan, A. Dhingra, M. Hanmandlu, and B. K. Panigrahi. 2017. Robust speaker verification using GFCC based i-vectors. In Proceedings of the International Conference on Signal, Networks, Computing, and Systems (ICSNCS'16), Vol. 1, Springer, 85--91.
[42]
T. W. Kuan, A. C. Tsai, P. H. Sung, J. F. Wang, and H. S. Kuo. 2016. A robust BFCC feature extraction for ASR system. Artif. Intell. Res. 5, 2 (2016), 14–23.
[43]
T. Gulzar, A. Singh, and S. Sharma. 2014. Comparative analysis of LPCC, MFCC and BFCC for the recognition of Hindi words using artificial neural networks. Int. J. Comput. Appl. 101, 12 (2014), 22–27.
[44]
O. Abdel-Hamid, A. R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu. 2014. Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 22, 10 (2014), 1533–1545.

Cited By

View all
  • (2024)Overcoming the barrier of emotion in interlingual transcription: A case study of Malayalam to English transcription using convolutional neural networks2024 OPJU International Technology Conference (OTCON) on Smart Computing for Innovation and Advancement in Industry 4.010.1109/OTCON60325.2024.10687437(1-6)Online publication date: 5-Jun-2024
  • (2024)Traffic Sign Recognition using Shift In-variant 2-D ConvNet2024 International Conference on Inventive Computation Technologies (ICICT)10.1109/ICICT60155.2024.10544943(135-141)Online publication date: 24-Apr-2024
  • (2024)Leveraging Contrastive Language–Image Pre-Training and Bidirectional Cross-attention for Multimodal Keyword SpottingEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109403138(109403)Online publication date: Dec-2024
  • Show More Cited By

Index Terms

  1. End-to-end Multi-modal Low-resourced Speech Keywords Recognition Using Sequential Conv2D Nets

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 1
    January 2024
    385 pages
    EISSN:2375-4702
    DOI:10.1145/3613498
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 January 2024
    Online AM: 24 July 2023
    Accepted: 21 May 2023
    Revised: 22 April 2023
    Received: 21 September 2022
    Published in TALLIP Volume 23, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Neural networks
    2. speech recognition
    3. sequential
    4. spectrogram
    5. convolution layers

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)234
    • Downloads (Last 6 weeks)20
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Overcoming the barrier of emotion in interlingual transcription: A case study of Malayalam to English transcription using convolutional neural networks2024 OPJU International Technology Conference (OTCON) on Smart Computing for Innovation and Advancement in Industry 4.010.1109/OTCON60325.2024.10687437(1-6)Online publication date: 5-Jun-2024
    • (2024)Traffic Sign Recognition using Shift In-variant 2-D ConvNet2024 International Conference on Inventive Computation Technologies (ICICT)10.1109/ICICT60155.2024.10544943(135-141)Online publication date: 24-Apr-2024
    • (2024)Leveraging Contrastive Language–Image Pre-Training and Bidirectional Cross-attention for Multimodal Keyword SpottingEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109403138(109403)Online publication date: Dec-2024
    • (2024)Speech Recognition Using Adaptation of Whisper ModelsArtificial Intelligence and Speech Technology10.1007/978-3-031-75164-6_24(323-334)Online publication date: 24-Nov-2024
    • (2024)Hindi Speech Recognition Using Deep Learning: A ReviewArtificial Intelligence and Speech Technology10.1007/978-3-031-75164-6_17(227-237)Online publication date: 24-Nov-2024
    • (2024)Kannada Continuous Speech Recognition Using Deep LearningAdvanced Network Technologies and Intelligent Computing10.1007/978-3-031-64067-4_17(258-269)Online publication date: 8-Aug-2024
    • (2023)Analysis and Classification of Dysarthric Speech2023 26th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)10.1109/O-COCOSDA60357.2023.10482956(1-6)Online publication date: 4-Dec-2023
    • (2023)Early detection of red palm weevil infestations using deep learning classification of acoustic signalsComputers and Electronics in Agriculture10.1016/j.compag.2023.108154212:COnline publication date: 1-Sep-2023

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media