Skip to main content

Advertisement

Log in

Hybrid RMDL-CNN for speech recognition from unclear speech signal

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

ASR is an effectual approach, which converts human speech into computer actions or text format. It involves extracting and determining the noise feature, the audio model, and the language model. The extraction and determination of the noise feature is a crucial aspect of speech recognition, serving as both a process of information compression and signal deconvolution. ASR schemes are mostly employed in smart homes, smart appliances, and biometric schemes. Yet, traditional approaches offer very low performance because of a noisy environment. Moreover, local differences and accents negatively influence the ASR scheme execution during the conversion of the speech signals. This paper introduces a hybrid RMDL-CNN method to address these challenges. At first, the input of unclear speech is carried out by the dataset. Then, signal pre-processing is done by employing a Gaussian filter. After that, voice enhancement is accomplished by employing nonlinear spectral subtraction. Later, the speech word is segmented from the enhanced output based on the Attentional Encoder-Decoder approach and finally, the speech is recognized using the proposed RMDL-CNN. The RMDL-CNN method is devised by the combination of RMDL and CNN. Furthermore, the established RMDL-CNN is accessed for its efficiency based on several values of k-group value, as well as learning data. In addition, the introduced RMDL-CNN approach for speech recognition achieved better accuracy, PPV, as well as NPV of 0.909, 0.947, and 0.917 for dataset 1. Moreover, the RMDL-CNN has achieved the highest accuracy of 0.909, PPV of 0.926 and NPV of 0.888 for dataset 2.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The data underlying this article are available in the SaarbruckerStimmdatenbank dataset is taken from, “https://datashare.ed.ac.uk/handle/10283/2791” accessed on September 2023. The data underlying this article are available in the Noisy speech database and is taken from "https://datashare.ed.ac.uk/handle/10283/2791" accessed on october 2023 .

Abbreviations

ASR:

Automatic speech recognition

RMDL-CNN:

Random multimodal deep learning-convolutional neural network

RMDL:

Random multimodal deep learning

CNN:

Convolutional neural network

PPV:

Positive predictive value

NPV:

Negative predictive value

HCI:

Human–computer interaction

NIDCD:

National institute on deafness as well as communication

VoIP:

Voice over internet protocol

DL:

Deep learning

RNN:

Recurrent neural networks

ML:

Machine learning

DNN:

Deep neural network

ELM:

Extreme learning machine

ANN:

Artificial neural network

SDRN:

Stochastic deep resilient network

HHO:

Harris Hawks optimization

VMD + CNN:

Variational mode decomposition and CNN

DSL-Net:

Domain-specific language speech network

CD-Net:

Confidence decision network

LSTM-RNN:

Long short-term memory recurrent neural network

CNN-BLSTM:

Convolution neural network-bidirectional long short-term memory

WER:

Word error rate

UL:

Unwritten language

WRL:

Well-resourced language

NMT:

Neural machine translation

NMT:

Neural machine translation

CRMDL:

Convolutional multimodel deep learning

EGG:

Electroglottographic

LSTM-RNN:

Long short-term memory recurrent neural network

CNN-BLSTM:

Convolution neural network-bidirectional long short-term memory

SER:

Speech emotion recognition

DFT:

Discrete Fourier transform

TTS:

Text-to-speech

References

  • Abdel-Hamid, O., Mohamed, A. R., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10), 1533–1545.

    Article  Google Scholar 

  • Ali, M. H., Jaber, M. M., Abd, S. K., Rehman, A., Awan, M. J., Vitkutė-Adžgauskienė, D., Damaševičius, R., & Bahaj, S. A. (2022). Harris Hawks sparse auto-encoder networks for automatic speech recognition system. Applied Sciences, 12(3), 1091.

    Article  Google Scholar 

  • Arpitha, V., Samvrudhi, K., Manjula, G., Sowmya, J. and Thanushree, G.B. (2020) Diagnosis of disordered speech using automatic speech recognition. International Journal of Engineering Research and Technology, 8(1), 127-132.

  • Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., Fissore, L., Laface, P., Mertins, A., Ris, C., & Rose, R. (2007). Automatic speech recognition and speech variability: A review. Speech Communication, 49(10–11), 763–786.

    Article  Google Scholar 

  • Dong, Z., Ding, Q., Zhai, W., & Zhou, M. (2023). A speech recognition method based on domain-specific datasets and confidence decision networks. Sensors, 23(13), 6036.

    Article  Google Scholar 

  • Dua, S., Kumar, S. S., Albagory, Y., Ramalingam, R., Dumka, A., Singh, R., Rashid, M., Gehlot, A., Alshamrani, S. S., & AlGhamdi, A. S. (2022). Developing a speech recognition system for recognizing tonal speech signals using a convolutional neural network. Applied Sciences, 12(12), 6223.

    Article  Google Scholar 

  • Gnanamanickam, J., Natarajan, Y., & Sri Preethaa, K. R. (2021). A hybrid speech enhancement algorithm for voice assistance application. Sensors, 21(21), 7025.

    Article  Google Scholar 

  • Godard, P., Zanon-Boito, M., Ondel, L., Berard, A., Yvon, F., Villavicencio, A. & Besacier, L. (2018). Unsupervised word segmentation from speech with attention. arXiv preprint arXiv:1806.06734

  • Ishibuchi, H., Nozaki, K., Yamamoto, N., & Tanaka, H. (1994). Construction of fuzzy classification systems with rectangular fuzzy rules using genetic algorithms. Fuzzysets and Systems, 65(2–3), 237–253.

    Article  MathSciNet  Google Scholar 

  • Johnson, M. T., Yuan, X., & Ren, Y. (2007). Speech signal enhancement through adaptive wavelet thresholding. Speech Communication, 49(2), 123–133.

    Article  Google Scholar 

  • Kowsari, K., Heidarysafa, M., Brown, D. E., Meimandi, K. J., & Barnes, L. E. (2018). RMDL: Random multimodel deep learning for classification. In Proceedings of the 2nd international conference on information system and data mining, (pp. 19–28).

  • Krishnan, P. T., Joseph Raj, A. N., & Rajangam, V. (2021). Emotion classification from speech signal based on empirical mode decomposition and non-linear features: Speech emotion recognition. Complex & Intelligent Systems, 7, 1919–1934.

    Article  Google Scholar 

  • Kulkarni, D. S., Deshmukh, R. R., & Shrishrimal, P. P. (2016). A review of speech signal enhancement techniques. International Journal of Computer Applications, 139(14), 23.

    Article  Google Scholar 

  • Lee, J. G., Kim, B. K., Jang, S. B., Yeon, S. H., & Ko, Y. W. (2016). Accuracy enhancement of RSSI-based distance estimation by applying Gaussian filter. Indian Journal of Science and Technology, 9(20), 1–5.

    Article  Google Scholar 

  • Liao, D., Cui, Z., Li, J., Li, W., & Wang, W. (2022). Surface defect detection of Si3N4 ceramic bearing ball based on improved homomorphic filter-Gaussian filter coupling algorithm. AIP Advances. https://doi.org/10.1063/5.0082702

    Article  Google Scholar 

  • Lv, Z., Li, X., & Li, W. (2017). Virtual reality geographical interactive scene semantics research for immersive geography learning. Neurocomputing, 254, 71–78.

    Article  Google Scholar 

  • Nagarajan, D., Broumi, S., & Smarandache, F. (2023). Neutrosophic speech recognition algorithm for speech under stress by machine learning. Neutrosophic Sets and Systems, 55(1), 4.

    Google Scholar 

  • Nguyen, H. T., Li, S., & Cheah, C. C. (2022). A layer-wise theoretical framework for deep learning of convolutional neural networks. IEEE Access, 10, 14270–14287.

    Article  Google Scholar 

  • Noisy speech database will be taken from https://datashare.ed.ac.uk/handle/10283/2791.

  • Oruh, J., Viriri, S., & Adegun, A. (2022). Long short-term memory recurrent neural network for automatic speech recognition. IEEE Access, 10, 30069–30079.

    Article  Google Scholar 

  • Prabhakar, G. A., Basel, B., Dutta, A., & Rama Rao, C. V. (2023). Multichannel CNN-BLSTM architecture for speech emotion recognition system by fusion of magnitude and phase spectral features using DCCA for consumer applications. IEEE Transactions on Consumer Electronics, 69(2), 226–235.

    Article  Google Scholar 

  • Rabiner, L. R., & Schafer, R. W. (2007). Introduction to digital speech processing. Foundations and Trends in Signal Processing, 1(1–2), 1–194.

    Article  Google Scholar 

  • Rajeswari, R., Devi, T., & Shalini, S. (2022). Dysarthric speech recognition using variational mode decomposition and convolutional neural networks. Wireless Personal Communications, 122(1), 293–307.

    Article  Google Scholar 

  • SaarbruckerStimmdatenbank dataset will be taken from. Retrieved September, 2023, from https://www.stimmdatenbank.coli.uni-saarland.de/

  • Shukla, S., & Jain, M. (2021). A novel stochastic deep resilient network for effective speech recognition. International Journal of Speech Technology, 24(3), 797–806.

    Article  Google Scholar 

  • Świetlicka, I., Kuniszyk-Jóźkowiak, W., & Świetlicki, M. (2022). Artificial neural networks combined with the principal component analysis for non-fluent speech recognition. Sensors, 22, 321.

    Article  Google Scholar 

  • Wang, H., Liu, Y., Zhen, X., & Tu, X. (2021). Depression speech recognition with a three-dimensional convolutional network. Frontiers in Human Neuroscience, 15, 713823.

    Article  Google Scholar 

  • Yu, J. (2023). Mobile communication voice enhancement under convolutional neural networks and the internet of things. Intelligent Automation & Soft Computing, 37(1), 777.

    Article  Google Scholar 

Download references

Acknowledgements

I would like to express my very great appreciation to the co-authors of this manuscript for their valuable and constructive suggestions during the planning and development of this research work.

Funding

This research did not receive any specific funding.

Author information

Authors and Affiliations

Authors

Contributions

All authors have made substantial contributions to the conception and design, revising the manuscript, and the final approval of the version to be published. Also, all authors agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Corresponding author

Correspondence to Raja Bhargava.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Ethical approval

Not Applicable.

Informed consent

Not Applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhargava, R., Arivazhagan, N. & Babu, K.S. Hybrid RMDL-CNN for speech recognition from unclear speech signal. Int J Speech Technol 28, 195–217 (2025). https://doi.org/10.1007/s10772-024-10167-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-024-10167-9

Keywords