A deep learning approach to integrate convolutional neural networks in speaker recognition

Hourri, Soufiane; Nikolov, Nikola S.; Kharroubi, Jamal

doi:10.1007/s10772-020-09718-7

A deep learning approach to integrate convolutional neural networks in speaker recognition

Published: 03 June 2020

Volume 23, pages 615–623, (2020)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

420 Accesses
8 Citations
Explore all metrics

Abstract

We propose a novel usage of convolutional neural networks (CNNs) for the problem of speaker recognition. While being particularly designed for computer vision problems, CNNs have recently been applied for speaker recognition by using spectrograms as input images. We believe that this approach is not optimal as it may result in two cumulative errors in solving both a computer vision and a speaker recognition problem. In this work, we aim at integrating CNNs in speaker recognition without relying on images. We use Restricted Boltzmann Machines (RBMs) to extract speakers models as matrices and introduce a new way to model target and non-target speakers, in order to perform speaker verification. Thus, we use a CNN to discriminate between target and non-target matrices. Experiments were conducted with the THUYG-20 SRE corpus under three noise conditions: clean, 9 db, and 0 db. The results demonstrate that our method outperforms the state-of-the-art approaches by decreasing the error rate by up to 60%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

CBAM: Convolutional Block Attention Module

Visualizing and Understanding Convolutional Networks

A survey of the recent architectures of deep convolutional neural networks

Article 21 April 2020

Asifullah Khan, Anabia Sohail, … Aqsa Saeed Qureshi

References

Beigi, H. (2011). Fundamentals of speaker recognition (1st ed.). New York: Springer. https://doi.org/10.1007/978-0-387-77592-0.
Book MATH Google Scholar
Bennani, Y., & Gallinari, P. (1994). Connectionist approaches for automatic speaker recognition. In: Proceedings of the Automatic Speaker Recognition, Identification and Verification.
Chen, Yh., Lopez-Moreno, I., Sainath, TN., Visontai, M., Alvarez, R., & Parada, C. (2015). Locally-connected and convolutional neural networks for small footprint speaker recognition. In: Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association.
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.
Article MathSciNet Google Scholar
Deng, L. (2014). A tutorial survey of architectures, algorithms, and applications for deep learning. Proceedings of the APSIPA Transactions on Signal and Information Processing.
Forsyth, M. E., Sutherland, A. M., Elliott, J., & Jack, M. A. (1993). Hmm speaker verification with sparse training data on telephone quality speech. Speech Communication, 13(3–4), 411–416.
Article Google Scholar
Ghahabi, O., & Hernando, J. (2014). Deep belief networks for i-vector based speaker recognition. In: Proceedings of the 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP).
Hanilçi, C. (2018). Data selection for i-vector based automatic speaker verification anti-spoofing. Digital Signal Processing, 72, 171–180.
Article Google Scholar
Hasan, M. R., Jamil, M., Rahman, M., et al. (2004). Speaker identification using mel frequency cepstral coefficients. Variations, 1(4), 9.
Google Scholar
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, Ar., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., & Kingsbury, B., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine 29.
Hinton, G.E. (2012). A practical guide to training restricted boltzmann machines. In: Proceedings of the Neural networks: Tricks of the trade.
Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.
Article MathSciNet Google Scholar
Hourri, S., & Kharroubi, J. (2019). A novel scoring method based on distance calculation for similarity measurement in text-independent speaker verification. Procedia Computer Science, 148, 256–265.
Article Google Scholar
Hourri, S., & Kharroubi, J. (2020). A deep learning approach for speaker recognition. International Journal of Speech Technology, 23(1), 123–131.
Article Google Scholar
Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P., & Alam, J. (2014). Deep neural networks for extracting baum-welch statistics for speaker recognition. In: Proceedings of the Odyssey, pp 293–298.
Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.
Article Google Scholar
Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In: Proceedings of the Advances in neural information processing systems.
Lee, KF., & Hon, HW. (1988). Large-vocabulary speaker-independent continuous speech recognition using hmm. In: Proceedings of the Acoustics, Speech, and Signal Processing, 1988. ICASSP-88, 1988 International Conference on.
Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: Proceedings of the Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on.
Li, C., Ma, X., Jiang, B., Li, X., Zhang, X., Liu, X., Cao, Y., Kannan, A., & Zhu, Z. (2017). Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:170502304.
Liu, Y., Qian, Y., Chen, N., Fu, T., Zhang, Y., & Yu, K. (2015). Deep feature for text-dependent speaker verification. Speech Communication, 73, 1–13.
Article Google Scholar
Lukic, Y., Vogt, C., Dürr, O., & Stadelmann, T. (2016). Speaker identification and clustering using convolutional neural networks. In: Proceedings of the 2016 IEEE 26th international workshop on machine learning for signal processing (MLSP).
Martinez, J., Perez, H., Escamilla, E., & Suzuki, MM. (2012). Speaker recognition using mel frequency cepstral coefficients (mfcc) and vector quantization (vq) techniques. In: Proceedings of the Electrical Communications and Computers (CONIELECOMP), 2012 22nd International Conference on.
McLaren, M., Lei, Y., & Ferrer, L. (2015). Advances in deep neural network approaches to speaker recognition. In: Proceedings of the Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on.
Ar, Mohamed, Dahl, G. E., & Hinton, G. (2011). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 14–22.
Google Scholar
Molau, S., Pitz, M., Schluter, R., & Ney, H. (2001). Computing mel-frequency cepstral coefficients on the power spectrum. In: Proceedings of the Acoustics, Speech, and Signal Processing, 2001, ICASSP’01, 2001 IEEE International Conference on.
Prasad, NV., & Umesh, S. (2013). Improved cepstral mean and variance normalization using bayesian framework. In: Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.
Reddy, D. R. (1976). Speech recognition by machine: A review. Proceedings of the IEEE, 64(4), 501–531.
Article Google Scholar
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.
Article Google Scholar
Richardson, F., Reynolds, D., & Dehak, N. (2015). Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22(10), 1671–1675.
Article Google Scholar
Rozi, A., Wang, D., Zhang, Z., & Zheng, TF. (2015). An open/free database and benchmark for uyghur speaker recognition. In: Proceedings of the Oriental COCOSDA held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2015 International Conference.
Sadjadi, S. O., & Hansen, J. H. (2015). Mean hilbert envelope coefficients (mhec) for robust speaker and language identification. Speech Communication, 72, 138–148.
Article Google Scholar
Salehghaffari, H. (2018). Speaker verification using convolutional neural networks. arXiv preprint arXiv:180305427.
Senoussaoui, M., Dehak, N., Kenny, P., Dehak, R., & Dumouchel, P. (2012). First attempt of boltzmann machines for speaker verification. In: Proceedings of the Odyssey 2012—speaker and language recognition workshop.
Shahin, I., & Botros, N. (1998). Speaker identification using dynamic time warping with stress compensation technique. In: Proceedings of the Southeastcon’98. Proceedings. IEEE.
Singh, S., & Rajan, E. (2011). Vector quantization approach for speaker recognition using mfcc and inverted mfcc. International Journal of Computer Applications, 17(1), 1–7.
Article Google Scholar
Soong, F. K., Rosenberg, A. E., Juang, B. H., & Rabiner, L. R. (1987). Report: A vector quantization approach to speaker recognition. AT&T Technical Journal, 66(2), 14–26.
Article Google Scholar
Tieleman, T., & Hinton, G. (2009). Using fast weights to improve persistent contrastive divergence. In: Proceedings of the Proceedings of the 26th Annual International Conference on Machine Learning.
Tirumala, SS., & Shahamiri, SR. (2016). A review on deep learning approaches in speaker identification. In: Proceedings of the 8th international conference on signal processing systems.
Tóth, L. (2014). Combining time-and frequency-domain convolution in convolutional neural network-based phone recognition. In: Proceedings of the 2014 IEEE International Conference on Acoustics.
Vasilakakis, V., Cumani, S., Laface, P., & Torino, P. (2013). Speaker recognition by means of deep belief networks. In: Proceedings of the Biometric Technologies in Forensic Science.
Zhang, C., Yu, C., & Hansen, J. H. (2017). An investigation of deep-learning frameworks for speaker verification antispoofing. IEEE Journal of Selected Topics in Signal Processing, 11(4), 684–694.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculté des Sciences et Techniques, Laboratoire des Systèmes Intelligents et Applications, Université Sidi Mohamed Ben Abdellah, Fez, Morocco
Soufiane Hourri & Jamal Kharroubi
University of Limerick, Limerick, Ireland
Nikola S. Nikolov

Authors

Soufiane Hourri
View author publications
You can also search for this author in PubMed Google Scholar
Nikola S. Nikolov
View author publications
You can also search for this author in PubMed Google Scholar
Jamal Kharroubi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Soufiane Hourri.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hourri, S., Nikolov, N.S. & Kharroubi, J. A deep learning approach to integrate convolutional neural networks in speaker recognition. Int J Speech Technol 23, 615–623 (2020). https://doi.org/10.1007/s10772-020-09718-7

Download citation

Received: 06 February 2020
Accepted: 12 May 2020
Published: 03 June 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s10772-020-09718-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A deep learning approach to integrate convolutional neural networks in speaker recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Visualizing and Understanding Convolutional Networks

A survey of the recent architectures of deep convolutional neural networks

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A deep learning approach to integrate convolutional neural networks in speaker recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Visualizing and Understanding Convolutional Networks

A survey of the recent architectures of deep convolutional neural networks

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation