Skip to main content
Log in

Weighted X-Vectors for Robust Text-Independent Speaker Verification with Multiple Enrollment Utterances

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Speech is a user-friendly signal for identity recognition with low computational complexity and implementation cost. However, the use of speech samples to identify persons involves several limitations, such as degraded performance in real environments due to the presence of different noises and channel effects. In recent years, deep neural network (DNN)-based approaches have provided good results in speaker verification and outperformed the i-vector based methods. The x-vector is a DNN-based speaker embedding that, in combination with probabilistic linear discriminant analysis (PLDA), increases both the accuracy and robustness of speaker verification systems. In this paper, we propose weighted x-vectors as a method for enhancing the speaker verification system in both clean and noisy environments. It exploits the statistical properties of target speaker enrollment x-vectors for weighting the test x-vector to enhance the scoring accuracy and thus the whole verification system. Experiments were conducted using the VoxCeleb dataset, MFCC feature vectors, and PLDA scoring method. The VoxCeleb is a large-scale dataset that contains real-world short-duration speech samples from over 6,000 speakers. Multicondition training for LDA and PLDA was also employed to improve the system’s performance under mismatched noisy circumstances. The findings showed that using weighted x-vectors led to 18% and 10% reductions in equal error rate (EER) term for clean and noisy conditions, respectively. Also, the experiments show that the increase of the number of enrollment x-vectors results in superior performance of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

In this study, four audio datasets were used: VoxCeleb, NOISEX-92, PNL 100, and Freesound.org. The VoxCeleb dataset is available to download for commercial/research purposes under a Creative Commons Attribution 4.0 International License. A complete version of the license and the audio files are available at the link below: https://www.robots.ox.ac.uk/~vgg/data/voxceleb/. Examples of the NOISEX database used in this study and the license information are available at the links below: http://spib.linse.ufsc.br/noise.html. http://www.speech.cs.cmu.edu/comp.speech/Section1/Data/noisex.html. The PNL 100, Collected by Guoning Hu during his dissertation study, is available to download without any restrictions at the link below: http://web.cse.ohio-state.edu/pnl/corpus/HuNonspeech/HuCorpus.html. The Freesound is a collaborative database of Creative Commons Licensed sound: https://freesound.org/.

Notes

  1. https://github.com/kaldi-asr/kaldi/tree/master/egs/voxceleb/v2

  2. https://kaldi-asr.org/models/m7

References

  1. A.K.H. Al-Ali, D. Dean, B. Senadji, V. Chandran, G.R. Naik, Enhanced forensic speaker verification using a combination of DWT and MFCC feature warping in the presence of noise and reverberation conditions. IEEE Access 5, 15400–15413 (2017). https://doi.org/10.1109/ACCESS.2017.2728801

    Article  Google Scholar 

  2. G. Biagetti, P. Crippa, L. Falaschetti, S. Orcioni, C. Turchetti, Speaker identification in noisy conditions using short sequences of speech frames, in International Conference on Intelligent Decision Technologies (2017), pp. 43–52. https://doi.org/10.1007/978-3-319-59424-8_5

  3. P.M. Bousquet, D. Matrouf, J.F. Bonastre, Intersession compensation and scoring methods in the i-vectors space for speaker recognition, in INTERSPEECH (2011), pp. 485–488

  4. J.S. Chung, A. Nagrani, A. Zisserman, Voxceleb2: Deep speaker recognition, in INTERSPEECH (2018), pp. 1086–1090. https://doi.org/10.21437/Interspeech.2018-1929

  5. N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Language Process 19(4), 778–798 (2011). https://doi.org/10.1109/TASL.2010.2064307

    Article  Google Scholar 

  6. “Freesound.org,” http://freesound.org

  7. D. Garcia-Romero, C.Y. Espy-Wilson, Analysis of i-vector length normalization in speaker recognition systems, in INTERSPEECH (2011), pp. 249–252

  8. D. Garcia-Romero, X. Zhou, C.Y. Espy-Wilson, Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 4257–4260. https://doi.org/10.1109/ICASSP.2012.6288859

  9. C.S. Greenberg, D. Bansé, G.R. Doddington, D. Garcia-Romero, J.J. Godfrey, T. Kinnunen, A.F. Martin, A. McCree, M. Przybocki, D.A. Reynolds, The NIST 2014 speaker recognition i-vector machine learning challenge, in The Speaker and Language Recognition Workshop (Odyssey) (2014), pp. 224–230

  10. R.M. Hanifa, K. Isa, S. Mohamad, A review on speaker recognition: Technology and challenges. Comput. Electr. Eng. 90, 107005 (2021). https://doi.org/10.1016/j.compeleceng.2021.107005

    Article  Google Scholar 

  11. J.H.L. Hansen, T. Hasan, Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. 32(6), 74–99 (2015). https://doi.org/10.1109/MSP.2015.2462851

    Article  Google Scholar 

  12. G. Hu, D. Wang, A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans. Audio Speech Lang. Process. 18, 2067–2079 (2010). https://doi.org/10.1109/TASL.2010.2041110

    Article  Google Scholar 

  13. Y. Ke, A. Li, C. Zheng, R. Peng, X. Li, Low-complexity artificial noise suppression methods for deep learning-based speech enhancement algorithms. EURASIP J. Audio Speech Music Process. 2021, 1–15 (2021). https://doi.org/10.1186/s13636-021-00204-9

    Article  Google Scholar 

  14. P. Kenny, Bayesian speaker verification with heavy-tailed priors, in The Speaker and Language Recognition Workshop (Odyssey) (2010), pp. 1–10

  15. T. Kinnunen, R. Saeidi, F. Sedlak, L. Kong Aik, J. Sandberg, M. Hansson-Sandsten, H. Li, Low-variance multitaper MFCC features: A case study in robust speaker verification. IEEE Trans. Audio Speech Language Process. 20, 1990–2001 (2012). https://doi.org/10.1109/TASL.2012.2191960

    Article  Google Scholar 

  16. Z. Lei, Y. Wan, J. Luo, Y. Yang, Mahalanobis metric scoring learned from weighted pairwise constraints in I-vector speaker recognition system, in INTERSPEECH (2016), pp. 1815–1819. https://doi.org/10.21437/Interspeech.2016-1071

  17. N. Li, M. Mak, SNR-invariant PLDA with multiple speaker subspaces, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5565–5569. https://doi.org/10.1109/ICASSP.2016.7472742

  18. E. Lleida, L.J. Rodriguez-Fuentes, Speaker and language recognition and characterization: introduction to the CSL special issue. Comput. Speech Lang. 49, 107–120 (2018). https://doi.org/10.1016/j.csl.2017.12.001

    Article  Google Scholar 

  19. M. McLaren, Y. Lei, L. Ferrer, Advances in deep neural network approaches to speaker recognition, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), pp. 4814–4818. https://doi.org/10.1109/ICASSP.2015.7178885

  20. M. McLaren, D. Van Leeuwen, Source-normalised-and-weighted LDA for robust speaker recognition using i-vectors, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2011), pp. 5456–5459. https://doi.org/10.1109/ICASSP.2011.5947593

  21. N. McLaughlin, J. Ming, D. Crookes, Speaker recognition in noisy conditions with limited training data, in 19th European Signal Processing Conference (2011), pp. 1294–1298.

  22. J. Ming, T.J. Hazen, J.R. Glass, D.A. Reynolds, Robust speaker recognition in noisy conditions. IEEE Trans. Audio Speech Lang. Process. 15, 1711–1723 (2007). https://doi.org/10.1109/TASL.2007.899278

    Article  Google Scholar 

  23. M. Mohammadamini, D. Matrouf, P.-G. Noé, Denoising x-vectors for Robust Speaker Recognition, in The Speaker and Language Recognition Workshop (Odyssey) (2020), pp. 75–80. https://doi.org/10.21437/odyssey.2020-11

  24. M. Mohammadi, H.R. Sadegh Mohammadi, Weighted I-vector based text-independent speaker verification system, in Iranian Conference on Electrical Engineering (ICEE) (2019), pp. 1647–1653. https://doi.org/10.1109/IranianCEE.2019.8786420

  25. A. Nagrani, J.S. Chung, W. Xie, A. Zisserman, Voxceleb: Large-scale speaker verification in the wild. Comput. Speech Lang. 60, 101027 (2020). https://doi.org/10.1016/j.csl.2019.101027

    Article  Google Scholar 

  26. NIST 2016 speaker recognition evaluation plan, https://www.nist.gov/itl/iad/mig/speaker-recognition. Accessed 2016

  27. O. Novotný, O. Plchot, O. Glembek, L. Burget, Analysis of DNN speech signal enhancement for robust speaker recognition. Comput. Speech Lang. 58, 403–421 (2019). https://doi.org/10.1016/j.csl.2019.06.004

    Article  Google Scholar 

  28. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, The Kaldi speech recognition toolkit, in IEEE workshop on automatic speech recognition and understanding (ASRU) (2011)

  29. P. Rajan, A. Afanasyev, V. Hautamäki, T. Kinnunen, From single to multiple enrollment i-vectors: practical PLDA scoring variants for speaker verification. Digital Signal Processing 31, 93–101 (2014). https://doi.org/10.1016/j.dsp.2014.05.001

    Article  Google Scholar 

  30. P. Rajan, T. Kinnunen, V. Hautamäki, Effect of multicondition training on i-vector PLDA configurations for speaker recognition, in INTERSPEECH (2013), pp. 3694–3697

  31. M. Ravanelli, Y. Bengio, Speaker recognition from raw waveform with SincNet, in IEEE Spoken Language Technology Workshop (SLT) (2018), pp. 1021–1028. https://doi.org/10.1109/SLT.2018.8639585

  32. F. Richardson, D. Reynolds, N. Dehak, Deep neural network approaches to speaker and language recognition. IEEE Signal Process. Lett. 22(10), 1671–1675 (2015). https://doi.org/10.1109/LSP.2015.2420092

    Article  Google Scholar 

  33. M. Rouvier, R. Dufour, P.-M. Bousquet, Review of different robust x-vector extractors for speaker verification, in 28th European Signal Processing Conference (EUSIPCO) (2020), pp. 1–5. https://doi.org/10.23919/Eusipco47968.2020.9287426

  34. Y. Shi, Q. Huang, T. Hain, Improving noise robustness in speaker identification using a two-stage attention model. arXiv preprint arXiv:1909.11200v2, (2019)

  35. D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, Deep neural network embeddings for text-independent speaker verification, in INTERSPEECH (2017), pp. 999–1003. https://doi.org/10.21437/Interspeech.2017-620

  36. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-Vectors: Robust DNN embeddings for speaker recognition, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp. 5329–5333. https://doi.org/10.1109/ICASSP.2018.8461375

  37. A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993). https://doi.org/10.1016/0167-6393(93)90095-3

    Article  Google Scholar 

  38. E. Variani, X. Lei, E. McDermott, I.L. Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 4052–4056. https://doi.org/10.1109/ICASSP.2014.6854363

  39. B. Vesnicer, J. Zganec-Gros, S. Dobrisek, V. Struc, Incorporating duration information into i-vector based speaker recognition systems, in The Speaker and Language Recognition Workshop (Odyssey) (2014), pp. 241–248

  40. H. Zeinali, A. Mirian, H. Sameti, B. BabaAli, Non-speaker information reduction from cosine similarity scoring in i-vector based speaker verification. Comput. Electr. Eng. 48, 226–238 (2015). https://doi.org/10.1016/j.compeleceng.2015.09.003

    Article  Google Scholar 

  41. H. Zeinali, S. Wang, A. Silnova, P. Matějka, O. Plchot, But system description to voxceleb speaker recognition challenge 2019. arXiv preprint arXiv:1910.12592, (2019)

  42. C. Zeng, X. Wang, E. Cooper, J. Yamagishi, Attention back-end for automatic speaker verification with multiple enrollment utterances. arXiv preprint arXiv:2104.01541 (2021)

  43. J. Zhou, T. Jiang, Q. Hong, L. Li, Extraction of noise-robust speaker embedding based on generative adversarial networks, in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (2019), pp. 1641–1645. https://doi.org/10.1109/APSIPAASC47483.2019.9023295

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hamid Reza Sadegh Mohammadi.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mohammadi, M., Sadegh Mohammadi, H.R. Weighted X-Vectors for Robust Text-Independent Speaker Verification with Multiple Enrollment Utterances. Circuits Syst Signal Process 41, 2825–2844 (2022). https://doi.org/10.1007/s00034-021-01915-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-021-01915-2

Keywords

Navigation