Skip to main content
Log in

Performance enhancement of text-independent speaker recognition in noisy and reverberation conditions using Radon transform with deep learning

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Automatic Speaker Recognition (ASR) in mismatched conditions is a challenging task, since robust feature extraction and classification techniques are required. Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) is an efficient network that can learn to recognize speakers, text-independently, when the recording circumstances are similar. Unfortunately, when the recording circumstances differ, its performance degrades. In this paper, Radon projection of the spectrograms of speech signals is implemented to get the features, since Radon Transform (RT) has less sensitivity to noise and reverberation conditions. The Radon projection is implemented on the spectrograms of speech signals, and then 2-D Discrete Cosine Transform (DCT) is computed. This technique improves the system recognition accuracy, text-independently with less sensitivity to noise and reverberation effects. The ASR system performance with the proposed features is compared to that of the system that depends on Mel Frequency Cepstral Coefficients (MFCCs) and spectrum features. For noisy utterances at 25 dB, the recognition rate with the proposed feature reaches 80%, while it is 27% and 28% with MFCCs and spectrum, respectively. For reverberant speech, the recognition rate reaches 80.67% with the proposed features, while it reaches 54% and 62.67% with the MFCCs and spectrum, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  • Abd El-samie, F. E. (2011). Information security for automatic speaker identification. Springer briefs in electrical and computer engineering. New York: Springer, 2011.

  • Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., & Baskurt, A. (2010). Action classification in soccer videos with long short-term memory recurrent neural networks (pp. 154–159). Berlin: Springer-Verlag.

    Google Scholar 

  • Campbell, J. P. (1997). Speaker recognition: A tutorial. In Proceedings of the IEEE, Vol. 85.

  • Das, A., Jena, M. R., & Barik, K. K. (2014). Mel-frequency cepstral coefficient (MFCC) a novel method for speaker recognition. Digital Technologies, 1(1), 1–3.

    Google Scholar 

  • Dennis, J., Dat, T., & Li, H. (2011). Spectrogram image feature for sound event classification in mismatched conditions. IEEE Signal Processing Letters, 18(2), 130–133.

    Article  Google Scholar 

  • Harimi, A., Shahzadi, A., Ahmadyfard, A., & Yaghmaie, K. (2013). Speech emotion recognition using radon and discrete cosine transform based features from speech spectrogram. Journal of Intelligent Automation Systems. https://doi.org/10.22044/JIAS.2014.223

  • Joshi, D., Upadhayay, M. D., & Joshi, S. D. (2013). Robust language and speaker identification using image processing techniques combined with PCA. IEEE, pp. 213–218.

  • Kinoshita, K., et al. (2016). A summary of the reverb challenge: State-of-the-art and remaining challenges in reverberant speech processing. Journal on Advances in Signal Processing. https://doi.org/10.1186/s13634-016-0306-6

    Article  Google Scholar 

  • Li, X., Wu, X. (2015). Modeling speaker variability using long short-term memory networks for speech recognition. In INTERSPEECH 2015, pp. 1086–1090, Sept 6–10.

  • Mohanan, N., Velmurugan, R., & Rao, P. (2018). A non-convolutive NMF model for speech dereverberation. In INTERSPEECH 2018, Indian Institute of Technology Bombay.

  • Parada, P. P., Sharma, D., Naylor, P. A., & van Waterschoot, T. (2014). Reverberant speech recognition: A phoneme analysis. In Proc. 2014 IEEE global conf. signal inf. process. (GlobalSIP '14), Atlanta, GA, USA, Dec. 2014, pp. 567–571.

  • Sharma, A., Singh, S. P., & Kumar, V. (2005). Text-independent speaker identification using back propagation MLP network classifier for a closed set of speaker. In: IEEE international symposium on signal processing and information technology. Allahabad: Indian Institute of Information Technology.

  • Sekar, K. (2012). “Performance analysis of text-independent speaker identification system”, International conference on modeling optimisation and computer. Procedia Engineering, 38, 1925–1934.

    Article  Google Scholar 

  • Zazo, R., Diez, A. L., Dominguez, J. G., Toledano, D. T., & Rodriguez, J. G. (2016). Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks. PLoS ONE. https://doi.org/10.1371/journal.pone.0146917,Jan.29

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Samia Abd El-Moneim.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

El-Moneim, S.A., El-Mordy, E.A., Nassar, M.A. et al. Performance enhancement of text-independent speaker recognition in noisy and reverberation conditions using Radon transform with deep learning. Int J Speech Technol 25, 679–687 (2022). https://doi.org/10.1007/s10772-021-09880-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-021-09880-6

Keywords

Navigation