Abstract
ASR is an effectual approach, which converts human speech into computer actions or text format. It involves extracting and determining the noise feature, the audio model, and the language model. The extraction and determination of the noise feature is a crucial aspect of speech recognition, serving as both a process of information compression and signal deconvolution. ASR schemes are mostly employed in smart homes, smart appliances, and biometric schemes. Yet, traditional approaches offer very low performance because of a noisy environment. Moreover, local differences and accents negatively influence the ASR scheme execution during the conversion of the speech signals. This paper introduces a hybrid RMDL-CNN method to address these challenges. At first, the input of unclear speech is carried out by the dataset. Then, signal pre-processing is done by employing a Gaussian filter. After that, voice enhancement is accomplished by employing nonlinear spectral subtraction. Later, the speech word is segmented from the enhanced output based on the Attentional Encoder-Decoder approach and finally, the speech is recognized using the proposed RMDL-CNN. The RMDL-CNN method is devised by the combination of RMDL and CNN. Furthermore, the established RMDL-CNN is accessed for its efficiency based on several values of k-group value, as well as learning data. In addition, the introduced RMDL-CNN approach for speech recognition achieved better accuracy, PPV, as well as NPV of 0.909, 0.947, and 0.917 for dataset 1. Moreover, the RMDL-CNN has achieved the highest accuracy of 0.909, PPV of 0.926 and NPV of 0.888 for dataset 2.












Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The data underlying this article are available in the SaarbruckerStimmdatenbank dataset is taken from, “https://datashare.ed.ac.uk/handle/10283/2791” accessed on September 2023. The data underlying this article are available in the Noisy speech database and is taken from "https://datashare.ed.ac.uk/handle/10283/2791" accessed on october 2023 .
Abbreviations
- ASR:
-
Automatic speech recognition
- RMDL-CNN:
-
Random multimodal deep learning-convolutional neural network
- RMDL:
-
Random multimodal deep learning
- CNN:
-
Convolutional neural network
- PPV:
-
Positive predictive value
- NPV:
-
Negative predictive value
- HCI:
-
Human–computer interaction
- NIDCD:
-
National institute on deafness as well as communication
- VoIP:
-
Voice over internet protocol
- DL:
-
Deep learning
- RNN:
-
Recurrent neural networks
- ML:
-
Machine learning
- DNN:
-
Deep neural network
- ELM:
-
Extreme learning machine
- ANN:
-
Artificial neural network
- SDRN:
-
Stochastic deep resilient network
- HHO:
-
Harris Hawks optimization
- VMD + CNN:
-
Variational mode decomposition and CNN
- DSL-Net:
-
Domain-specific language speech network
- CD-Net:
-
Confidence decision network
- LSTM-RNN:
-
Long short-term memory recurrent neural network
- CNN-BLSTM:
-
Convolution neural network-bidirectional long short-term memory
- WER:
-
Word error rate
- UL:
-
Unwritten language
- WRL:
-
Well-resourced language
- NMT:
-
Neural machine translation
- NMT:
-
Neural machine translation
- CRMDL:
-
Convolutional multimodel deep learning
- EGG:
-
Electroglottographic
- LSTM-RNN:
-
Long short-term memory recurrent neural network
- CNN-BLSTM:
-
Convolution neural network-bidirectional long short-term memory
- SER:
-
Speech emotion recognition
- DFT:
-
Discrete Fourier transform
- TTS:
-
Text-to-speech
References
Abdel-Hamid, O., Mohamed, A. R., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10), 1533–1545.
Ali, M. H., Jaber, M. M., Abd, S. K., Rehman, A., Awan, M. J., Vitkutė-Adžgauskienė, D., Damaševičius, R., & Bahaj, S. A. (2022). Harris Hawks sparse auto-encoder networks for automatic speech recognition system. Applied Sciences, 12(3), 1091.
Arpitha, V., Samvrudhi, K., Manjula, G., Sowmya, J. and Thanushree, G.B. (2020) Diagnosis of disordered speech using automatic speech recognition. International Journal of Engineering Research and Technology, 8(1), 127-132.
Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., Fissore, L., Laface, P., Mertins, A., Ris, C., & Rose, R. (2007). Automatic speech recognition and speech variability: A review. Speech Communication, 49(10–11), 763–786.
Dong, Z., Ding, Q., Zhai, W., & Zhou, M. (2023). A speech recognition method based on domain-specific datasets and confidence decision networks. Sensors, 23(13), 6036.
Dua, S., Kumar, S. S., Albagory, Y., Ramalingam, R., Dumka, A., Singh, R., Rashid, M., Gehlot, A., Alshamrani, S. S., & AlGhamdi, A. S. (2022). Developing a speech recognition system for recognizing tonal speech signals using a convolutional neural network. Applied Sciences, 12(12), 6223.
Gnanamanickam, J., Natarajan, Y., & Sri Preethaa, K. R. (2021). A hybrid speech enhancement algorithm for voice assistance application. Sensors, 21(21), 7025.
Godard, P., Zanon-Boito, M., Ondel, L., Berard, A., Yvon, F., Villavicencio, A. & Besacier, L. (2018). Unsupervised word segmentation from speech with attention. arXiv preprint arXiv:1806.06734
Ishibuchi, H., Nozaki, K., Yamamoto, N., & Tanaka, H. (1994). Construction of fuzzy classification systems with rectangular fuzzy rules using genetic algorithms. Fuzzysets and Systems, 65(2–3), 237–253.
Johnson, M. T., Yuan, X., & Ren, Y. (2007). Speech signal enhancement through adaptive wavelet thresholding. Speech Communication, 49(2), 123–133.
Kowsari, K., Heidarysafa, M., Brown, D. E., Meimandi, K. J., & Barnes, L. E. (2018). RMDL: Random multimodel deep learning for classification. In Proceedings of the 2nd international conference on information system and data mining, (pp. 19–28).
Krishnan, P. T., Joseph Raj, A. N., & Rajangam, V. (2021). Emotion classification from speech signal based on empirical mode decomposition and non-linear features: Speech emotion recognition. Complex & Intelligent Systems, 7, 1919–1934.
Kulkarni, D. S., Deshmukh, R. R., & Shrishrimal, P. P. (2016). A review of speech signal enhancement techniques. International Journal of Computer Applications, 139(14), 23.
Lee, J. G., Kim, B. K., Jang, S. B., Yeon, S. H., & Ko, Y. W. (2016). Accuracy enhancement of RSSI-based distance estimation by applying Gaussian filter. Indian Journal of Science and Technology, 9(20), 1–5.
Liao, D., Cui, Z., Li, J., Li, W., & Wang, W. (2022). Surface defect detection of Si3N4 ceramic bearing ball based on improved homomorphic filter-Gaussian filter coupling algorithm. AIP Advances. https://doi.org/10.1063/5.0082702
Lv, Z., Li, X., & Li, W. (2017). Virtual reality geographical interactive scene semantics research for immersive geography learning. Neurocomputing, 254, 71–78.
Nagarajan, D., Broumi, S., & Smarandache, F. (2023). Neutrosophic speech recognition algorithm for speech under stress by machine learning. Neutrosophic Sets and Systems, 55(1), 4.
Nguyen, H. T., Li, S., & Cheah, C. C. (2022). A layer-wise theoretical framework for deep learning of convolutional neural networks. IEEE Access, 10, 14270–14287.
Noisy speech database will be taken from https://datashare.ed.ac.uk/handle/10283/2791.
Oruh, J., Viriri, S., & Adegun, A. (2022). Long short-term memory recurrent neural network for automatic speech recognition. IEEE Access, 10, 30069–30079.
Prabhakar, G. A., Basel, B., Dutta, A., & Rama Rao, C. V. (2023). Multichannel CNN-BLSTM architecture for speech emotion recognition system by fusion of magnitude and phase spectral features using DCCA for consumer applications. IEEE Transactions on Consumer Electronics, 69(2), 226–235.
Rabiner, L. R., & Schafer, R. W. (2007). Introduction to digital speech processing. Foundations and Trends in Signal Processing, 1(1–2), 1–194.
Rajeswari, R., Devi, T., & Shalini, S. (2022). Dysarthric speech recognition using variational mode decomposition and convolutional neural networks. Wireless Personal Communications, 122(1), 293–307.
SaarbruckerStimmdatenbank dataset will be taken from. Retrieved September, 2023, from https://www.stimmdatenbank.coli.uni-saarland.de/
Shukla, S., & Jain, M. (2021). A novel stochastic deep resilient network for effective speech recognition. International Journal of Speech Technology, 24(3), 797–806.
Świetlicka, I., Kuniszyk-Jóźkowiak, W., & Świetlicki, M. (2022). Artificial neural networks combined with the principal component analysis for non-fluent speech recognition. Sensors, 22, 321.
Wang, H., Liu, Y., Zhen, X., & Tu, X. (2021). Depression speech recognition with a three-dimensional convolutional network. Frontiers in Human Neuroscience, 15, 713823.
Yu, J. (2023). Mobile communication voice enhancement under convolutional neural networks and the internet of things. Intelligent Automation & Soft Computing, 37(1), 777.
Acknowledgements
I would like to express my very great appreciation to the co-authors of this manuscript for their valuable and constructive suggestions during the planning and development of this research work.
Funding
This research did not receive any specific funding.
Author information
Authors and Affiliations
Contributions
All authors have made substantial contributions to the conception and design, revising the manuscript, and the final approval of the version to be published. Also, all authors agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Ethical approval
Not Applicable.
Informed consent
Not Applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bhargava, R., Arivazhagan, N. & Babu, K.S. Hybrid RMDL-CNN for speech recognition from unclear speech signal. Int J Speech Technol 28, 195–217 (2025). https://doi.org/10.1007/s10772-024-10167-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-024-10167-9