Hybrid RMDL-CNN for speech recognition from unclear speech signal

Bhargava, Raja; Arivazhagan, N.; Babu, Kunchala Suresh

doi:10.1007/s10772-024-10167-9

Hybrid RMDL-CNN for speech recognition from unclear speech signal

Published: 03 March 2025

Volume 28, pages 195–217, (2025)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Raja Bhargava¹,
N. Arivazhagan² &
Kunchala Suresh Babu³

92 Accesses
Explore all metrics

Abstract

ASR is an effectual approach, which converts human speech into computer actions or text format. It involves extracting and determining the noise feature, the audio model, and the language model. The extraction and determination of the noise feature is a crucial aspect of speech recognition, serving as both a process of information compression and signal deconvolution. ASR schemes are mostly employed in smart homes, smart appliances, and biometric schemes. Yet, traditional approaches offer very low performance because of a noisy environment. Moreover, local differences and accents negatively influence the ASR scheme execution during the conversion of the speech signals. This paper introduces a hybrid RMDL-CNN method to address these challenges. At first, the input of unclear speech is carried out by the dataset. Then, signal pre-processing is done by employing a Gaussian filter. After that, voice enhancement is accomplished by employing nonlinear spectral subtraction. Later, the speech word is segmented from the enhanced output based on the Attentional Encoder-Decoder approach and finally, the speech is recognized using the proposed RMDL-CNN. The RMDL-CNN method is devised by the combination of RMDL and CNN. Furthermore, the established RMDL-CNN is accessed for its efficiency based on several values of k-group value, as well as learning data. In addition, the introduced RMDL-CNN approach for speech recognition achieved better accuracy, PPV, as well as NPV of 0.909, 0.947, and 0.917 for dataset 1. Moreover, the RMDL-CNN has achieved the highest accuracy of 0.909, PPV of 0.926 and NPV of 0.888 for dataset 2.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Real-Time Automatic Continuous Speech Recognition System for Kannada Language/Dialects

Article 01 January 2024

Amalgamation of noise elimination and TDNN acoustic modelling techniques for the advancements in continuous Kannada ASR system

Article 29 July 2023

Automatic Speech Recognition

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The data underlying this article are available in the SaarbruckerStimmdatenbank dataset is taken from, “https://datashare.ed.ac.uk/handle/10283/2791” accessed on September 2023. The data underlying this article are available in the Noisy speech database and is taken from "https://datashare.ed.ac.uk/handle/10283/2791" accessed on october 2023 .

Abbreviations

ASR:: Automatic speech recognition
RMDL-CNN:: Random multimodal deep learning-convolutional neural network
RMDL:: Random multimodal deep learning
CNN:: Convolutional neural network
PPV:: Positive predictive value
NPV:: Negative predictive value
HCI:: Human–computer interaction
NIDCD:: National institute on deafness as well as communication
VoIP:: Voice over internet protocol
DL:: Deep learning
RNN:: Recurrent neural networks
ML:: Machine learning
DNN:: Deep neural network
ELM:: Extreme learning machine
ANN:: Artificial neural network
SDRN:: Stochastic deep resilient network
HHO:: Harris Hawks optimization
VMD + CNN:: Variational mode decomposition and CNN
DSL-Net:: Domain-specific language speech network
CD-Net:: Confidence decision network
LSTM-RNN:: Long short-term memory recurrent neural network
CNN-BLSTM:: Convolution neural network-bidirectional long short-term memory
WER:: Word error rate
UL:: Unwritten language
WRL:: Well-resourced language
NMT:: Neural machine translation
NMT:: Neural machine translation
CRMDL:: Convolutional multimodel deep learning
EGG:: Electroglottographic
LSTM-RNN:: Long short-term memory recurrent neural network
CNN-BLSTM:: Convolution neural network-bidirectional long short-term memory
SER:: Speech emotion recognition
DFT:: Discrete Fourier transform
TTS:: Text-to-speech

References

Abdel-Hamid, O., Mohamed, A. R., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10), 1533–1545.
Article Google Scholar
Ali, M. H., Jaber, M. M., Abd, S. K., Rehman, A., Awan, M. J., Vitkutė-Adžgauskienė, D., Damaševičius, R., & Bahaj, S. A. (2022). Harris Hawks sparse auto-encoder networks for automatic speech recognition system. Applied Sciences, 12(3), 1091.
Article Google Scholar
Arpitha, V., Samvrudhi, K., Manjula, G., Sowmya, J. and Thanushree, G.B. (2020) Diagnosis of disordered speech using automatic speech recognition. International Journal of Engineering Research and Technology, 8(1), 127-132.
Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., Fissore, L., Laface, P., Mertins, A., Ris, C., & Rose, R. (2007). Automatic speech recognition and speech variability: A review. Speech Communication, 49(10–11), 763–786.
Article Google Scholar
Dong, Z., Ding, Q., Zhai, W., & Zhou, M. (2023). A speech recognition method based on domain-specific datasets and confidence decision networks. Sensors, 23(13), 6036.
Article Google Scholar
Dua, S., Kumar, S. S., Albagory, Y., Ramalingam, R., Dumka, A., Singh, R., Rashid, M., Gehlot, A., Alshamrani, S. S., & AlGhamdi, A. S. (2022). Developing a speech recognition system for recognizing tonal speech signals using a convolutional neural network. Applied Sciences, 12(12), 6223.
Article Google Scholar
Gnanamanickam, J., Natarajan, Y., & Sri Preethaa, K. R. (2021). A hybrid speech enhancement algorithm for voice assistance application. Sensors, 21(21), 7025.
Article Google Scholar
Godard, P., Zanon-Boito, M., Ondel, L., Berard, A., Yvon, F., Villavicencio, A. & Besacier, L. (2018). Unsupervised word segmentation from speech with attention. arXiv preprint arXiv:1806.06734
Ishibuchi, H., Nozaki, K., Yamamoto, N., & Tanaka, H. (1994). Construction of fuzzy classification systems with rectangular fuzzy rules using genetic algorithms. Fuzzysets and Systems, 65(2–3), 237–253.
Article MathSciNet Google Scholar
Johnson, M. T., Yuan, X., & Ren, Y. (2007). Speech signal enhancement through adaptive wavelet thresholding. Speech Communication, 49(2), 123–133.
Article Google Scholar
Kowsari, K., Heidarysafa, M., Brown, D. E., Meimandi, K. J., & Barnes, L. E. (2018). RMDL: Random multimodel deep learning for classification. In Proceedings of the 2nd international conference on information system and data mining, (pp. 19–28).
Krishnan, P. T., Joseph Raj, A. N., & Rajangam, V. (2021). Emotion classification from speech signal based on empirical mode decomposition and non-linear features: Speech emotion recognition. Complex & Intelligent Systems, 7, 1919–1934.
Article Google Scholar
Kulkarni, D. S., Deshmukh, R. R., & Shrishrimal, P. P. (2016). A review of speech signal enhancement techniques. International Journal of Computer Applications, 139(14), 23.
Article Google Scholar
Lee, J. G., Kim, B. K., Jang, S. B., Yeon, S. H., & Ko, Y. W. (2016). Accuracy enhancement of RSSI-based distance estimation by applying Gaussian filter. Indian Journal of Science and Technology, 9(20), 1–5.
Article Google Scholar
Liao, D., Cui, Z., Li, J., Li, W., & Wang, W. (2022). Surface defect detection of Si3N4 ceramic bearing ball based on improved homomorphic filter-Gaussian filter coupling algorithm. AIP Advances. https://doi.org/10.1063/5.0082702
Article Google Scholar
Lv, Z., Li, X., & Li, W. (2017). Virtual reality geographical interactive scene semantics research for immersive geography learning. Neurocomputing, 254, 71–78.
Article Google Scholar
Nagarajan, D., Broumi, S., & Smarandache, F. (2023). Neutrosophic speech recognition algorithm for speech under stress by machine learning. Neutrosophic Sets and Systems, 55(1), 4.
Google Scholar
Nguyen, H. T., Li, S., & Cheah, C. C. (2022). A layer-wise theoretical framework for deep learning of convolutional neural networks. IEEE Access, 10, 14270–14287.
Article Google Scholar
Noisy speech database will be taken from https://datashare.ed.ac.uk/handle/10283/2791.
Oruh, J., Viriri, S., & Adegun, A. (2022). Long short-term memory recurrent neural network for automatic speech recognition. IEEE Access, 10, 30069–30079.
Article Google Scholar
Prabhakar, G. A., Basel, B., Dutta, A., & Rama Rao, C. V. (2023). Multichannel CNN-BLSTM architecture for speech emotion recognition system by fusion of magnitude and phase spectral features using DCCA for consumer applications. IEEE Transactions on Consumer Electronics, 69(2), 226–235.
Article Google Scholar
Rabiner, L. R., & Schafer, R. W. (2007). Introduction to digital speech processing. Foundations and Trends in Signal Processing, 1(1–2), 1–194.
Article Google Scholar
Rajeswari, R., Devi, T., & Shalini, S. (2022). Dysarthric speech recognition using variational mode decomposition and convolutional neural networks. Wireless Personal Communications, 122(1), 293–307.
Article Google Scholar
SaarbruckerStimmdatenbank dataset will be taken from. Retrieved September, 2023, from https://www.stimmdatenbank.coli.uni-saarland.de/
Shukla, S., & Jain, M. (2021). A novel stochastic deep resilient network for effective speech recognition. International Journal of Speech Technology, 24(3), 797–806.
Article Google Scholar
Świetlicka, I., Kuniszyk-Jóźkowiak, W., & Świetlicki, M. (2022). Artificial neural networks combined with the principal component analysis for non-fluent speech recognition. Sensors, 22, 321.
Article Google Scholar
Wang, H., Liu, Y., Zhen, X., & Tu, X. (2021). Depression speech recognition with a three-dimensional convolutional network. Frontiers in Human Neuroscience, 15, 713823.
Article Google Scholar
Yu, J. (2023). Mobile communication voice enhancement under convolutional neural networks and the internet of things. Intelligent Automation & Soft Computing, 37(1), 777.
Article Google Scholar

Download references

Acknowledgements

I would like to express my very great appreciation to the co-authors of this manuscript for their valuable and constructive suggestions during the planning and development of this research work.

Funding

This research did not receive any specific funding.

Author information

Authors and Affiliations

Research Scholar, Department of Computer Science and Engineering, SRM Institute of Science and Technology, Kattankulathur, Chennai, India
Raja Bhargava
Department of Computational Intelligence, SRM Institute of Science and Technology, Kattankulathur, Chennai, India
N. Arivazhagan
Department of Computer Science and Engineering, Potti Sriramulu Chaluvadi Mallikarjuna Rao college of Engineering and Technology, Vijayawada, Andhra Pradesh, India
Kunchala Suresh Babu

Authors

Raja Bhargava
View author publications
You can also search for this author inPubMed Google Scholar
N. Arivazhagan
View author publications
You can also search for this author inPubMed Google Scholar
Kunchala Suresh Babu
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

All authors have made substantial contributions to the conception and design, revising the manuscript, and the final approval of the version to be published. Also, all authors agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Corresponding author

Correspondence to Raja Bhargava.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Ethical approval

Not Applicable.

Informed consent

Not Applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bhargava, R., Arivazhagan, N. & Babu, K.S. Hybrid RMDL-CNN for speech recognition from unclear speech signal. Int J Speech Technol 28, 195–217 (2025). https://doi.org/10.1007/s10772-024-10167-9

Download citation

Received: 29 March 2024
Accepted: 18 December 2024
Published: 03 March 2025
Issue Date: March 2025
DOI: https://doi.org/10.1007/s10772-024-10167-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hybrid RMDL-CNN for speech recognition from unclear speech signal

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Real-Time Automatic Continuous Speech Recognition System for Kannada Language/Dialects

Amalgamation of noise elimination and TDNN acoustic modelling techniques for the advancements in continuous Kannada ASR system

Automatic Speech Recognition

Explore related subjects

Data availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now