Abstract
Speaker recognition is one of several biometric recognition systems owing to its high importance in numerous applications of security and telecommunications. The key aspiration of speaker recognition systems is to know who is speaking depending on voice characteristics. This paper presents an extensive study of speaker recognition in both text-dependent and text-independent cases. Convolutional Neural Network (CNN) based feature extraction is extended to the text-dependent and text-independent speaker recognition tasks. In addition, the effect of reverberation on the speaker recognition system is addressed. All speech signals are converted into images by obtaining their spectrograms. Two proposed CNN models are presented for efficient speaker recognition from clean and reverberant speech signals. They depend on image processing concepts applied on spectrograms of speech signals. One of the proposed models is compared with a conventional Benchmark model in the text-independent scenario. The performance of the recognition system is measured by the recognition rate in the cases of clean and reverberant speech.
Similar content being viewed by others
References
Abd El-Samie, F. E. (2011). Information Security for Automatic Speaker Identification.” Springer briefs in electrical and computer engineering. Berlin: Springer.
Barbu, T. (2007). A supervised text-independent speaker recognition approach. International Journal of Electronics and Communication Engineering, 1, 2726–2730.
Hioka, Y., Tang, J. W., & Wan, J. (2016). Effect of adding artificial reverberation to speech-like masking sound. Applied Acoustics, 114, 171–178.
Hiremani, V. A. (2015). Speaker recognition: A survey. International Journal of Emerging Technology and Advanced Engineering, 5(7), 325–335.
KINGMA, Diederik P., & Jimmy, B. A. (2014). Adam: A method for stochastic Ooptimization. arXiv preprint arXiv:1412.6980.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Lukic, Y., Vogt, C., Durr, O., & Stadelmann, T. (2016). Speaker identification and clustering using convolutional neural networks. In IEEE international workshop on machine learning for signal processing (pp. 13–16).
Lukic, Y., Vogt, C. Durr, O., Stadelmann, T. (2016). Speaker identification and clustering using convolutional neural networks. In IEEE international workshop on machine learning for signal processing, Sept. 13–16, 2016.
Magic Data Technology Co., Ltd. Retrieved May 2019 from http://www.imagicdatatech.com/index.php/home/dataopensource/data_info/id/101.
Muda, L., Begam, M., & Elamvazuthi, I. (2010). Voice Recognition Algorithms using Mel-Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) techniques. Journal of Computing, 2, 2151–9617.
Nayana P. K. et al. (2017). Comparison of text independent speaker identification systems using GMM and i-vector methods. In 7th international conference on advances in computing & communications, ICACC-2017 India (pp.47–54), August 2017.
Neammalai, P., Phimoltares, S, & Lursinsap, C. (2015). Speech and music classification using hybrid form of spectrogram and Fourier transformation. In IEEE international conference, Siem Reap, Cambodia, accepted 16 February 2015
Nishanth, K., & Karthik, G. (2015). Identification of diabetic maculopathy stages using fundus images. Journal of Molecular Image and Dynamics, 33, 319–119.
Oppenheim, A. V. (1970). Speech spectrograms using the fast Fourier transform. In IEEE spectrum, international conference, September 1970.
Palaz, D., Magimai-Doss, M., & Collobert, R. (2015). Analysis of CNN-based speech recognition system using raw speech as Iinput, Interspeech (pp. 11–15).
Parada, P. P., Sharma, D., Naylor, P. A., & Waterschoot, T. V. (2014). Reverberant speech recognition: A phoneme analysis. In Proceedings on 2014 IEEE global conference signal information process (pp. 567–571).
Ramgire, J. B., & Jagdale, S. M. (2016). A survey on speaker recognition with various feature extraction and classification techniques. International Research Journal of Engineering and Technology, 03(04), 709–712.
Ranzato, M. A., Huang, F. J., Boureau, Y. L., & LeCun, Y. (2007). Unsupervised learning of invariant feature hierarchies with applications to object recognition. In Computer vision and pattern recognition, 2007. CVPR'07. IEEE conference (pp. 1–8).
Saquib, Z., Salam, N., Nair, R. P., Pandey, N., & Joshi, A. (2010). A survey on automatic speaker recognition systems. Communications in Computer and Information Science, 123, 134–145.
Su, H. (2018). Combining speech and speaker recognition: A joint modeling approach. Electrical Engineering and Computer Sciences, 10 August 2018.
Tirumala, S. S., Shahamiri, S. R., Garhwal, A. S., & Wang, R. (2017). Speaker identification features extraction methods: A systematic review. Expert Systems with Applications, 90(250–271), 2017.
Togneri, R., & Pullella, D. (2011). An overview of speaker identification: Accuracy and robustness issues. IEEE Circuits and Systems Magazine, 11, 23–61.
Unoki, M., & Hiramatsu, S. (2008). MTF-based method of blind estimation of reverberation in room acoustics. In: 16th European signal processing conference (EUSIPCO 2008), August 2008.
Wang, Y. (2012). Robust text-independent speaker identification in a time-varying noisy environment. Journal of Software, 7(9), 1975–1980.
Yegnanarayana, B., & Murthy, P. S. (2000). Enhancement of reverberant speech using LP residual signal. IEEE Transactions on Speech Audio Processing, 8, 267–281.
Zhang, C., Yu, C., & Hansen, J. H. L. (2016). An investigation of deep learning frameworks for speaker verification anti-spoofing. IEEE Journal of Selected Topics in Signal Processing, 99(1–11), 2016.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
El-Moneim, S.A., Sedik, A., Nassar, M.A. et al. Text-dependent and text-independent speaker recognition of reverberant speech based on CNN. Int J Speech Technol 24, 993–1006 (2021). https://doi.org/10.1007/s10772-021-09805-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-021-09805-3