A deep learning approach for speaker recognition

Hourri, Soufiane; Kharroubi, Jamal

doi:10.1007/s10772-019-09665-y

A deep learning approach for speaker recognition

Published: 18 December 2019

Volume 23, pages 123–131, (2020)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

1189 Accesses
27 Citations
Explore all metrics

Abstract

Speaker verification (SV) is an important branch in speaker recognition. Several approaches have been investigated within the last few decades. In this context, deep learning has received much more interest by speech processing researchers, and it was introduced recently in speaker recognition. In most cases, deep learning models are adapted from speech recognition applications and applied to speaker recognition, and they have been showing their capability of being competitors to the state-of-the-art approaches. Nevertheless, the use of deep learning in speaker recognition is still linked to speech recognition. In this study, we are proposing a new way to use deep neural networks (DNNs) in speaker recognition, in the purpose to facilitate to DNN to learn features distribution. We have been motivated by our previous work, where we have proposed a novel scoring method that works perfectly with clean speech, but it needs improvements under noisy conditions. For this reason, we are aiming to transform the extracted feature vectors (MFCCs) into enhanced feature vectors, that we denote Deep Speaker Features (DeepSFs). Experiments have been conducted on THUYG-20 SRE corpus, and significant results have been achieved. Moreover, this new method outperformed both i-vector/PLDA and our baseline system in both clean and noisy conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

Mishaim Malik, Muhammad Kamran Malik, … Imran Makhdoom

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Article 29 September 2022

Yogesh Kumar, Apeksha Koul & Chamkaur Singh

A Deep Learning Framework for Audio Deepfake Detection

Article 08 November 2021

Janavi Khochare, Chaitali Joshi, … Faruk Kazi

References

Ai, O. C., Hariharan, M., Yaacob, S., & Chee, L. S. (2012). Classification of speech dysfluencies with mfcc and lpcc features. Expert Systems with Applications, 39(2), 2157–2165.
Article Google Scholar
Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., & Bengio, Y. (2016). End-to-end attention-based large vocabulary speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, IEEE (pp. 4945–4949).
Beigi, H. (2011). Fundamentals of speaker recognition (1st ed.). New York: Springer. https://doi.org/10.1007/978-0-387-77592-0.
Book MATH Google Scholar
Bouziane, A., Kadi, H., Hourri, S., & Kharroubi, J. (2016). An open and free speech corpus for speaker recognition: The fscsr speech corpus. In Intelligent Systems: Theories and Applications (SITA), 2016 11th International Conference on, IEEE, (pp. 1–5).
Cochran, W. T., Cooley, J. W., Favin, D. L., Helms, H. D., Kaenel, R. A., Lang, W. W., et al. (1967). What is the fast fourier transform? Proceedings of the IEEE, 55(10), 1664–1674.
Article Google Scholar
Deng, L. (2014). A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Transactions on Signal and Information Processing, 3.
Dong, C., Loy, C. C., He, K., & Tang, X. (2016). Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2), 295–307.
Article Google Scholar
Forsyth, M. E., Sutherland, A. M., Elliott, J., & Jack, M. A. (1993). Hmm speaker verification with sparse training data on telephone quality speech. Speech Communication, 13(3–4), 411–416.
Article Google Scholar
Hanilçi, C. (2018). Data selection for i-vector based automatic speaker verification anti-spoofing. Digital Signal Processing, 72, 171–180.
Article Google Scholar
Hasan, M. R., Jamil, M., Rahman, M., & et al. (2004). Speaker identification using mel frequency cepstral coefficients. variations, 1(4).
Hermansky, H. (1990). Perceptual linear predictive (plp) analysis of speech. The Journal of the Acoustical Society of America, 87(4), 1738–1752.
Article Google Scholar
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, Ar, Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.
Article Google Scholar
Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.
Article MathSciNet MATH Google Scholar
Hourri, S., & Kharroubi, J. (2019). A novel scoring method based on distance calculation for similarity measurement in text-independent speaker verification. Procedia Computer Science, 148, 256–265.
Article Google Scholar
Kabal, P., & Ramachandran, R. P. (1986). The computation of line spectral frequencies using chebyshev polynomials. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(6), 1419–1426.
Article Google Scholar
Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 3128–3137).
Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P., & Alam, J. (2014). Deep neural networks for extracting baum-welch statistics for speaker recognition. In Proc. Odyssey, (pp. 293–298).
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980.
Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.
Article Google Scholar
Lee, K.F., & Hon, H.W. (1988). Large-vocabulary speaker-independent continuous speech recognition using hmm. In Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on, IEEE, (pp. 123–126).
Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, IEEE, (pp. 1695–1699).
Liu, Y., Qian, Y., Chen, N., Fu, T., Zhang, Y., & Yu, K. (2015). Deep feature for text-dependent speaker verification. Speech Communication, 73, 1–13.
Article Google Scholar
Martinez, J., Perez, H., Escamilla, E., & Suzuki, M. M. (2012). Speaker recognition using mel frequency cepstral coefficients (mfcc) and vector quantization (vq) techniques. In: Electrical Communications and Computers (CONIELECOMP), 2012 22nd International Conference on, IEEE, pp. (248–251).
McLaren, M., Lei, Y., & Ferrer, L. (2015). Advances in deep neural network approaches to speaker recognition. In: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, IEEE, (pp. 4814–4818).
Mohamed, A., Dahl, G. E., Hinton, G., et al. (2012). Acoustic modeling using deep belief networks. IEEE Trans Audio, Speech & Language Processing, 20(1), 14–22.
Article Google Scholar
Molau, S., Pitz, M., Schluter, R., & Ney, H. (2001). Computing mel-frequency cepstral coefficients on the power spectrum. In: Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on, IEEE, vol 1, (pp. 73–76).
Qawaqneh, Z., Mallouh, A. A., & Barkana, B. D. (2017). Deep neural network framework and transformed mfccs for speaker’s age and gender classification. Knowledge-Based Systems, 115, 5–14.
Article Google Scholar
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.
Article Google Scholar
Richardson, F., Reynolds, D., & Dehak, N. (2015). Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22(10), 1671–1675.
Article Google Scholar
Rozi, A., Wang, D., Zhang, Z., & Zheng, T. F. (2015). An open/free database and benchmark for uyghur speaker recognition. In: Oriental COCOSDA held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2015 International Conference, IEEE, (pp. 81–85).
Senoussaoui, M., Dehak, N., Kenny, P., Dehak, R., & Dumouchel, P. (2012). First attempt of boltzmann machines for speaker verification. In Odyssey 2012-The Speaker and Language Recognition Workshop.
Shahin, I., & Botros, N. (1998). Speaker identification using dynamic time warping with stress compensation technique. In: Southeastcon’98. Proceedings. IEEE, IEEE, (pp. 65–68).
Singh, S., & Rajan, E. (2011). Vector quantization approach for speaker recognition using mfcc and inverted mfcc. International Journal of Computer Applications, 17(1), 1–7.
Article Google Scholar
Soong, F. K., Rosenberg, A. E., Juang, B. H., & Rabiner, L. R. (1987). Report: A vector quantization approach to speaker recognition. AT&T Technical Journal, 66(2), 14–26.
Article Google Scholar
Tirumala, S. S., & Shahamiri, S. R. (2016). A review on deep learning approaches in speaker identification. In Proceedings of the 8th international conference on signal processing systems, ACM, (pp. 142–147).
Vasilakakis, V., Cumani, S., Laface, P., & Torino, P. (2013). Speaker recognition by means of deep belief networks. Proc Biometric Technologies in Forensic Science.
Yujin, Y., Peihua, Z., & Qun, Z. (2010). Research of speaker recognition based on combination of lpcc and mfcc. In: Intelligent Computing and Intelligent Systems (ICIS), 2010 IEEE International Conference on, IEEE, vol 3, (pp. 765–767).
Zhang, C., Yu, C., & Hansen, J. H. (2017). An investigation of deep-learning frameworks for speaker verification antispoofing. IEEE Journal of Selected Topics in Signal Processing, 11(4), 684–694.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Université Sidi Mohamed Ben Abdellah, Faculté des Sciences et Techniques, Laboratoire des Systèmes Intelligents et Applications, 30000, Fez, Morocco
Soufiane Hourri & Jamal Kharroubi

Authors

Soufiane Hourri
View author publications
You can also search for this author in PubMed Google Scholar
Jamal Kharroubi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Soufiane Hourri.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hourri, S., Kharroubi, J. A deep learning approach for speaker recognition. Int J Speech Technol 23, 123–131 (2020). https://doi.org/10.1007/s10772-019-09665-y

Download citation

Received: 02 February 2019
Accepted: 16 December 2019
Published: 18 December 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10772-019-09665-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A deep learning approach for speaker recognition

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

A Deep Learning Framework for Audio Deepfake Detection

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A deep learning approach for speaker recognition

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

A Deep Learning Framework for Audio Deepfake Detection

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation