Abstract
The performance of Continuous Automatic Speech Recognition Systems (CASRS) in networks communications degrades rapidly in the presence of speech signal variability such as noisy environment, channel communication, and speech codec. There are several techniques proposed to improve recognition accuracy. The ASR consists of two main processing steps: feature extraction (Front-End) and classification (Back-End). We are motivated to develop speech separation algorithms (feature enhancement) to improve the intelligibility of noisy speech and the accuracy of ASR. We use non-negative matrix factorization and Ideal Binary Mask, which are estimated by a deep neural network (DNN) to use the Spectro-temporal structures of magnitude spectrograms for supervised speech separation. The ASR is based on the convolution neural network where the input is the Log Mel Cepstrum features. The system was trained using 440 sentences of 20 speakers encoded AMR-NB database and contaminated with various levels of signal-to-noise ratio (0 dB, 5 dB and 10 dB).
Similar content being viewed by others
References
Abdel-Hamid, O., Mohamed, A. R., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10), 1533–1545. https://doi.org/10.1109/TASLP.2014.2339736
Addou, D., Selouani, S. A., Kifaya, K., Boudraa, M., & Boudraa, B. (2007). A noise-robust front-end for distributed speech recognition in mobile communications. International Journal of Speech Technology, 10(4), 167–173. https://doi.org/10.1007/s10772-009-9025-9
Bouchakour, L., & Debyeche, M. (2018). Improving continuous Arabic speech recognition over mobile networks DSR and NSR using MFCCS features transformed. International Journal of Circuits, Systems and Signal Processing, 12, 1–8.
Chang, S. Y., & Morgan, N. (2014). Robust CNN-based speech recognition with Gabor filter kernels. In Fifteenth annual conference of the International Speech Communication Association.
Choi, W., Park, S., Han, D. K., & Ko, H. (2015). Acoustic event recognition using dominant spectral basis vectors. In Sixteenth annual conference of the International Speech Communication Association (INTERSPEECH 2015)
Ciaburro, G., & Venkateswaran, B. (2017). Neural networks with R: Smart models using CNN, RNN, deep learning, and artificial intelligence principles. Packt Publishing Ltd.
Deng, L., Droppo, J., & Acero, A. (2004). Enhancement of log mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise. IEEE Transactions on Speech and Audio Processing, 12(2), 133–143. https://doi.org/10.1109/TSA.2003.820201
Djamel, A., & Sid-Ahmed, S. (2015). Optimisation of multiple feature stream weights for distributed speech processing in mobile environments. IET Signal Processing, 9(4), 387–394.
Han, W., Chan, C. F., Choy, C. S., & Pun, K. P. (2006, May). An efficient Mfcc extraction method in speech recognition. In 2006 IEEE international symposium on circuits and systems (p. 4). https://doi.org/10.1109/ISCAS.2006.1692543
Holmes, W. (2001). Speech synthesis and recognition. CRC Press.
Ittichaichareon, C., Suksri, S., & Yingthawornsuk, T. (2012, July). Speech recognition using MFCC. In International conference on computer graphics, simulation and modeling (pp. 135–138).
Kolossa, D., & Haeb-Umbach, R. (Eds.). (2011). Robust speech recognition of uncertain or missing data: Theory and applications. Springer. https://doi.org/10.1007/978-3-642-21317-5
Narayanan, A., & Wang, D. (2013, May). Ideal ratio mask estimation using deep neural networks for robust speech recognition. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 7092–7096)
Narayanan, A., & Wang, D. (2014). Investigation of speech separation as a front-end for noise robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4), 826–835.
Nie, S., Liang, S., Liu, W., Zhang, X., & Tao, J. (2018). Deep learning based speech separation via Nmf-style reconstructions. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(11), 2043–2055.
Peinado, A., & Segura, J. (2006). Speech recognition over digital channels: Robustness and standards. Wiley.
Rennie, S. J., Hershey, J. R., & Olsen, P. A. (2008, March). Efficient model-based speech separation and denoising using non-negative subspace analysis. In 2008 IEEE international conference on acoustics, speech and signal processing (pp. 1833–1836). IEEE.
Rohlfing, C., Becker, J. M., & Wien, M. (2016, March). Nmf-based informed source separation. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 474–478). IEEE.
Schmidt, M. N., & Olsson, R. K. (2006). Single-channel speech separation using sparse non-negative matrix factorization. In Ninth international conference on spoken language processing.
Virtanen, T. (2007). Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Transactions on Audio, Speech, and Language Processing, 15(3), 1066–1074. https://doi.org/10.1109/TASL.2006.885253
Virtanen, T., Singh, R., & Raj, B. (Eds.). (2012). Techniques for noise robustness in automatic speech recognition (pp. 251–322). Wiley.
Wang, D. (2005). On ideal binary mask as the computational goal of auditory scene analysis. In Speech separation by humans and machines (pp. 181–197). Springer. https://doi.org/10.1007/0-387-22794-6_12
Wang, Y., Narayanan, A., & Wang, D. (2014). On training targets for supervised speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12), 1849–1858. https://doi.org/10.1109/TASLP.2014.2352935
Weiss, R. J., & Ellis, D. P. (2006). Estimating single-channel source separation masks: Relevance vector machine classifiers vs. pitch-based masking. In Proceedings of ISCA tutorial and research workshop on statistical and perceptual audition (SAPA) (pp. 31–36). https://doi.org/10.7916/D83F501S
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bouchakour, L., Debyeche, M. Noise-robust speech recognition in mobile network based on convolution neural networks. Int J Speech Technol 25, 269–277 (2022). https://doi.org/10.1007/s10772-021-09950-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-021-09950-9