DNN and i-vector combined method for speaker recognition on multi-variability environments

Reyes-Díaz, Flavio J.; Hernández-Sierra, Gabriel; de Lara, José R. Calvo

doi:10.1007/s10772-021-09796-1

DNN and i-vector combined method for speaker recognition on multi-variability environments

Published: 25 January 2021

Volume 24, pages 409–418, (2021)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Flavio J. Reyes-Díaz ORCID: orcid.org/0000-0003-3358-3188¹,
Gabriel Hernández-Sierra¹ &
José R. Calvo de Lara¹

278 Accesses
Explore all metrics

Abstract

The article deals with the compensation of variability in Automatic Speaker Verification systems in scenarios where the variability conditions due to utterance duration, reverberation and environmental noise are simultaneously present. We introduce a new representation of the speaker’s discriminative information, based on the use of a deep neural network trained discriminatively for speaker classification and i-vector representation. The proposed representation allows us to increase the verification performance by reducing the error between 2.5 and 7.9 % for all variability conditions compared to baseline systems. We also analyze the speaker verification system robustness based on interquartile range, obtaining a 1.19 times improvement compared to baselines evaluated.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weighted X-Vectors for Robust Text-Independent Speaker Verification with Multiple Enrollment Utterances

Article 16 January 2022

I-Vector Extraction Using Speaker Relevancy for Short Duration Speaker Recognition

Improved i-Vector Representation for Speaker Diarization

Article Open access 22 December 2015

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

The UBM refers to a Universal Background Model of the population.
http://dnt.kr.hsnr.de/download.html.

References

Al-Ali, A. K. H., Senadji, B., & Naik, G. R. (2017). Enhanced forensic speaker verification using multi-run ica in the presence of environmental noise and reverberation conditions. In: Proceedings of ICSIPA. IEEE, pp 174–179.
Alam, M. J., Kenny, P., Bhattacharya, G., & Kockmann, M. (2017). Speaker verification under adverse conditions using i-vector adaptation and neural networks. In: Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20–24, 2017, pp 3732–3736.
Avila, A. R., Paja, M. O. S., & Fraga, F. J., et al. (2014). Improving the performance of far-field speaker verification using multi-condition training: the case of GMM-UBM and i-vector systems. In: INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14–18, 2014, pp 1096–1100.
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4), 357–366.
Article Google Scholar
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Trans Audio, Speech & Language Processing, 19(4), 788–798.
Article Google Scholar
Garcia-Romero, D., & Espy-Wilson, C. Y. (2011). Analysis of i-vector length normalization in speaker recognition systems. In: INTERSPEECH 2011, 12th Annual conference of the international speech communication association, Florence, Italy, August 27–31, 2011, pp 249–252.
Garcia-Romero, D., Zhou, X., & Espy-Wilson, C. Y. (2012). Multicondition training of gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition. In: 2012 IEEE international conference on acoustics, speech and signal processing, ICASSP 2012, Kyoto, Japan, March 25-30, 2012, pp 4257–4260.
Gonzalez-Rodriguez, J. (2014). Evaluating automatic speaker recognition systems: An overview of the nist speaker recognition evaluations (1996–2014). Loquens, 1(1), 007.
Article Google Scholar
Greenberg, C. S., Stanford, V. M., Martin, A. F., Yadagiri, M., Doddington, G. R., Godfrey, J. J., & Hernandez-Cordero, J. (2013). The 2012 NIST speaker recognition evaluation. In: INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon, France, August 25–29, 2013, pp 1971–1975.
Guo, J., Xu, N., Qian, K., Shi, Y., Xu, K., Wu, Y., et al. (2018). Deep neural network based i-vector mapping for speaker verification using short utterances. Speech Communication, 105, 92–102.
Article Google Scholar
Hinton, G., Deng, L., Yu, D., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.
Article Google Scholar
Hinton, G. E. (2012). A practical guide to training restricted boltzmann machines. Neural networks: Tricks of the trade (2nd ed., pp. 599–619). Berlin, Heidelberg: Springer.
Chapter Google Scholar
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.
Article MathSciNet Google Scholar
Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.
Article MathSciNet Google Scholar
Kenny, P. (2005). Joint factor analysis of speaker and session variability: Theory and algorithms. CRIM, Montreal (Report) CRIM-06/08-13.
Kenny, P. (2010). Bayesian speaker verification with heavy-tailed priors. In: Odyssey 2010: The speaker and language recognition workshop, Brno, Czech Republic, June 28–July 1, 2010, p. 14.
Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2005). Factor analysis simplified. In: 2005 IEEE international conference on acoustics, speech, and signal processing, ICASSP ’05, Philadelphia, Pennsylvania, USA, March 18–23, 2005, pp. 637–640.
Kenny, P., Stafylakis, T., Ouellet, P., Gupta, V., & Alam, MJ. (2014). Deep neural networks for extracting baum-welch statistics for speaker recognition. In: Odyssey 2014: The speaker and language recognition workshop, Joensuu, Finland, June 16–19, 2014.
Kheder, W. B., Matrouf, D., Ajili, M., & Bonastre, J. F. (2018). A unified joint model to deal with nuisance variabilities in the i-vector space. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(3), 633–645.
Article Google Scholar
Kim, C., & Stern, R. M. (2016). Power-normalized cepstral coefficients (PNCC) for robust speech recognition. IEEE/ACM Trans Audio, Speech & Language Processing, 24(7), 1315–1329.
Article Google Scholar
Kinoshita, K., Delcroix, M., Yoshioka, T., & Nakatani, T., et al. (2013). The reverb challenge: Acommon evaluation framework for dereverberation and recognition of reverberant speech. In: IEEE workshop on applications of signal processing to audio and acoustics, WASPAA 2013, New Paltz, NY, USA, October 20–23, 2013, pp. 1–4.
Kudashev, O., Novoselov, S., Pekhovsky, T., Simonchik, K., & Lavrentyeva, G. (2016). Usage of DNN in speaker recognition: Advantages and problems. In: Advances in Neural Networks-ISNN 2016, 13th International symposium on neural networks, ISNN 2016, St. Petersburg, Russia, July 6–8, 2016, Proceedings, pp 82–91.
Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: IEEE international conference on acoustics, speech and signal processing, ICASSP 2014, Florence, Italy, May 4–9, 2014, pp 1695–1699.
Ma, J., Sethu, V., Ambikairajah, E., & Lee, K. A. (2017). Duration compensation of i-vectors for short duration speaker verification. Electronics Letters, 53(6), 405–407.
Article Google Scholar
Mohamed, A., Dahl, G. E., & Hinton, G. E. (2012). Acoustic modeling using deep belief networks. IEEE Trans Audio, Speech & Language Processing, 20(1), 14–22.
Article Google Scholar
Novotný, O., Plchot, O., Matejka, P., Mosner, L., & Glembek, O. (2018). On the use of x-vectors for robust speaker recognition. Odyssey the speaker and language recognition workshop, 26–29 June 2018, (pp. 168–175). Les Sables d’Olonne.
Pekhovsky, T., Novoselov, S., Sholohov, A., & Kudashev, O. (2016). On autoencoders in the i-vector space for speaker recognition. In: Odyssey 2016: The speaker and language recognition workshop, Bilbao, Spain, June 21–24, 2016, pp 217–224.
Poddar, A., Sahidullah, M., & Saha, G. (2017). Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biometrics, 7(2), 91–101.
Article Google Scholar
Rajan, P., Kinnunen, T., & Hautamäki, V. (2013). Effect of multicondition training on i-vector PLDA configurations for speaker recognition. In: INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon, France, August 25–29, 2013, pp 3694–3697.
Reyes-Díaz, F. J., Hernández-Sierra, G., & Calvo-de Lara, J. R. (2017). Two-space variability compensation technique for speaker verification in short length and reverberant environments. International Journal of Speech Technology (IJST), 20(3), 475–485.
Article Google Scholar
Reyes-Díaz, F. J., Roble-Gutiérres, A., Hernández-Sierra, G., & Calvo-de Lara, J. R. (2018). Filtrado wiener para la reducción de ruido en la verificación de locutores. Revista Cubana de Ciencias Informáticas (RCCI), 12(3), 152–162.
Google Scholar
Ribas, D., Vincent, E., & Calvo-de Lara, J. R. (2015). Full multicondition training for robust i-vector based speaker recognition. In: INTERSPEECH 2015, 16th annual conference of the international speech communication association, Dresden, Germany, September 6–10, 2015, pp 1057–1061.
Richardson, F., Reynolds, D., & Dehak, N. (2015). Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22(10), 1671–1675.
Article Google Scholar
Saad, D. (2010). On-line learning in neural networks. Cambridge: Cambridge University Press.
Google Scholar
Scheffer, N., Ferrer, L., Lawson, A., Lei, Y., & McLaren, M. (2013). Recent developments in voice biometrics: Robustness and high accuracy. In: 2013 IEEE international conference on technologies for homeland security (HST), pp 447–452.
Senior, A. W., Sak, H., & Shafran, I. (2015). Context dependent phone models for LSTM RNN acoustic modelling. In: 2015 IEEE international conference on acoustics, speech and signal processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19–24, 2015, pp 4585–4589.
Snyder, D., Garcia-Romero, D., Povey, D., & Khudanpur, S. (2017). Deep neural network embeddings for text-independent speaker verification. In: Interspeech 2017, 18th annual conference of the international speech communication association, Stockholm, Sweden, August 20–24, 2017, pp 999–1003.
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust DNN embeddings for speaker recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing, ICASSP 2018, Calgary, AB, Canada, April 15–20, 2018, pp. 5329–5333.
Sohn, J., Kim, N. S., & Sung, W. (1999). A statistical model-based voice activity detection. IEEE Signal Processing Letters, 6(1), 1–3.
Article Google Scholar
Solanas, A., & Pérez, A. (2004). Estadística descriptiva en ciencias del comportamiento. Thomson, https://books.google.com.cu/books?id=NOBYAAAACAAJ.
Xu, L., Das, RK., Yılmaz, E., Yang, J., & Li, H. (2018). Generative x-vectors for text-independent speaker verification. arXiv preprint arXiv:180906798.
Zhang, C., & Koishida, K. (2017). End-to-end text-independent speaker verification with triplet loss on short utterances. In: Interspeech 2017, 18th annual conference of the international speech communication association, Stockholm, Sweden, August 20–24, 2017, pp 1487–1491.

Download references

Author information

Authors and Affiliations

Advanced Technologies Application Center (CENATAV)., 7a.A # 21406 e/ 214 y 216, Playa, Havana, C.P. 12200, Cuba
Flavio J. Reyes-Díaz, Gabriel Hernández-Sierra & José R. Calvo de Lara

Authors

Flavio J. Reyes-Díaz
View author publications
You can also search for this author inPubMed Google Scholar
Gabriel Hernández-Sierra
View author publications
You can also search for this author inPubMed Google Scholar
José R. Calvo de Lara
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Flavio J. Reyes-Díaz.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Reyes-Díaz, F.J., Hernández-Sierra, G. & de Lara, J.R.C. DNN and i-vector combined method for speaker recognition on multi-variability environments. Int J Speech Technol 24, 409–418 (2021). https://doi.org/10.1007/s10772-021-09796-1

Download citation

Received: 24 June 2020
Accepted: 02 January 2021
Published: 25 January 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s10772-021-09796-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DNN and i-vector combined method for speaker recognition on multi-variability environments

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Weighted X-Vectors for Robust Text-Independent Speaker Verification with Multiple Enrollment Utterances

I-Vector Extraction Using Speaker Relevancy for Short Duration Speaker Recognition

Improved i-Vector Representation for Speaker Diarization

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now