Skip to main content
Log in

Noise-robust speech recognition in mobile network based on convolution neural networks

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

The performance of Continuous Automatic Speech Recognition Systems (CASRS) in networks communications degrades rapidly in the presence of speech signal variability such as noisy environment, channel communication, and speech codec. There are several techniques proposed to improve recognition accuracy. The ASR consists of two main processing steps: feature extraction (Front-End) and classification (Back-End). We are motivated to develop speech separation algorithms (feature enhancement) to improve the intelligibility of noisy speech and the accuracy of ASR. We use non-negative matrix factorization and Ideal Binary Mask, which are estimated by a deep neural network (DNN) to use the Spectro-temporal structures of magnitude spectrograms for supervised speech separation. The ASR is based on the convolution neural network where the input is the Log Mel Cepstrum features. The system was trained using 440 sentences of 20 speakers encoded AMR-NB database and contaminated with various levels of signal-to-noise ratio (0 dB, 5 dB and 10 dB).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Abdel-Hamid, O., Mohamed, A. R., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10), 1533–1545. https://doi.org/10.1109/TASLP.2014.2339736

    Article  Google Scholar 

  • Addou, D., Selouani, S. A., Kifaya, K., Boudraa, M., & Boudraa, B. (2007). A noise-robust front-end for distributed speech recognition in mobile communications. International Journal of Speech Technology, 10(4), 167–173. https://doi.org/10.1007/s10772-009-9025-9

    Article  Google Scholar 

  • Bouchakour, L., & Debyeche, M. (2018). Improving continuous Arabic speech recognition over mobile networks DSR and NSR using MFCCS features transformed. International Journal of Circuits, Systems and Signal Processing, 12, 1–8.

    Google Scholar 

  • Chang, S. Y., & Morgan, N. (2014). Robust CNN-based speech recognition with Gabor filter kernels. In Fifteenth annual conference of the International Speech Communication Association.

  • Choi, W., Park, S., Han, D. K., & Ko, H. (2015). Acoustic event recognition using dominant spectral basis vectors. In Sixteenth annual conference of the International Speech Communication Association (INTERSPEECH 2015)

  • Ciaburro, G., & Venkateswaran, B. (2017). Neural networks with R: Smart models using CNN, RNN, deep learning, and artificial intelligence principles. Packt Publishing Ltd.

    Google Scholar 

  • Deng, L., Droppo, J., & Acero, A. (2004). Enhancement of log mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise. IEEE Transactions on Speech and Audio Processing, 12(2), 133–143. https://doi.org/10.1109/TSA.2003.820201

    Article  Google Scholar 

  • Djamel, A., & Sid-Ahmed, S. (2015). Optimisation of multiple feature stream weights for distributed speech processing in mobile environments. IET Signal Processing, 9(4), 387–394.

    Article  Google Scholar 

  • Han, W., Chan, C. F., Choy, C. S., & Pun, K. P. (2006, May). An efficient Mfcc extraction method in speech recognition. In 2006 IEEE international symposium on circuits and systems (p. 4). https://doi.org/10.1109/ISCAS.2006.1692543

  • Holmes, W. (2001). Speech synthesis and recognition. CRC Press.

    Google Scholar 

  • Ittichaichareon, C., Suksri, S., & Yingthawornsuk, T. (2012, July). Speech recognition using MFCC. In International conference on computer graphics, simulation and modeling (pp. 135–138).

  • Kolossa, D., & Haeb-Umbach, R. (Eds.). (2011). Robust speech recognition of uncertain or missing data: Theory and applications. Springer. https://doi.org/10.1007/978-3-642-21317-5

  • Narayanan, A., & Wang, D. (2013, May). Ideal ratio mask estimation using deep neural networks for robust speech recognition. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 7092–7096)

  • Narayanan, A., & Wang, D. (2014). Investigation of speech separation as a front-end for noise robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4), 826–835.

    Article  Google Scholar 

  • Nie, S., Liang, S., Liu, W., Zhang, X., & Tao, J. (2018). Deep learning based speech separation via Nmf-style reconstructions. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(11), 2043–2055.

    Article  Google Scholar 

  • Peinado, A., & Segura, J. (2006). Speech recognition over digital channels: Robustness and standards. Wiley.

    Book  Google Scholar 

  • Rennie, S. J., Hershey, J. R., & Olsen, P. A. (2008, March). Efficient model-based speech separation and denoising using non-negative subspace analysis. In 2008 IEEE international conference on acoustics, speech and signal processing (pp. 1833–1836). IEEE.

  • Rohlfing, C., Becker, J. M., & Wien, M. (2016, March). Nmf-based informed source separation. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 474–478). IEEE.

  • Schmidt, M. N., & Olsson, R. K. (2006). Single-channel speech separation using sparse non-negative matrix factorization. In Ninth international conference on spoken language processing.

  • Virtanen, T. (2007). Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Transactions on Audio, Speech, and Language Processing, 15(3), 1066–1074. https://doi.org/10.1109/TASL.2006.885253

    Article  Google Scholar 

  • Virtanen, T., Singh, R., & Raj, B. (Eds.). (2012). Techniques for noise robustness in automatic speech recognition (pp. 251–322). Wiley.

  • Wang, D. (2005). On ideal binary mask as the computational goal of auditory scene analysis. In Speech separation by humans and machines (pp. 181–197). Springer. https://doi.org/10.1007/0-387-22794-6_12

  • Wang, Y., Narayanan, A., & Wang, D. (2014). On training targets for supervised speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12), 1849–1858. https://doi.org/10.1109/TASLP.2014.2352935

    Article  Google Scholar 

  • Weiss, R. J., & Ellis, D. P. (2006). Estimating single-channel source separation masks: Relevance vector machine classifiers vs. pitch-based masking. In Proceedings of ISCA tutorial and research workshop on statistical and perceptual audition (SAPA) (pp. 31–36). https://doi.org/10.7916/D83F501S

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lallouani Bouchakour.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bouchakour, L., Debyeche, M. Noise-robust speech recognition in mobile network based on convolution neural networks. Int J Speech Technol 25, 269–277 (2022). https://doi.org/10.1007/s10772-021-09950-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-021-09950-9

Keywords

Navigation