Weighted X-Vectors for Robust Text-Independent Speaker Verification with Multiple Enrollment Utterances

Mohammadi, Mohsen; Sadegh Mohammadi, Hamid Reza

doi:10.1007/s00034-021-01915-2

Weighted X-Vectors for Robust Text-Independent Speaker Verification with Multiple Enrollment Utterances

Published: 16 January 2022

Volume 41, pages 2825–2844, (2022)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

322 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Speech is a user-friendly signal for identity recognition with low computational complexity and implementation cost. However, the use of speech samples to identify persons involves several limitations, such as degraded performance in real environments due to the presence of different noises and channel effects. In recent years, deep neural network (DNN)-based approaches have provided good results in speaker verification and outperformed the i-vector based methods. The x-vector is a DNN-based speaker embedding that, in combination with probabilistic linear discriminant analysis (PLDA), increases both the accuracy and robustness of speaker verification systems. In this paper, we propose weighted x-vectors as a method for enhancing the speaker verification system in both clean and noisy environments. It exploits the statistical properties of target speaker enrollment x-vectors for weighting the test x-vector to enhance the scoring accuracy and thus the whole verification system. Experiments were conducted using the VoxCeleb dataset, MFCC feature vectors, and PLDA scoring method. The VoxCeleb is a large-scale dataset that contains real-world short-duration speech samples from over 6,000 speakers. Multicondition training for LDA and PLDA was also employed to improve the system’s performance under mismatched noisy circumstances. The findings showed that using weighted x-vectors led to 18% and 10% reductions in equal error rate (EER) term for clean and noisy conditions, respectively. Also, the experiments show that the increase of the number of enrollment x-vectors results in superior performance of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A deep learning approach for speaker recognition

Article 18 December 2019

A tibetan-dependent speaker recognition method based on deep learning

Article 07 April 2022

DNN and i-vector combined method for speaker recognition on multi-variability environments

Article 25 January 2021

Data availability

In this study, four audio datasets were used: VoxCeleb, NOISEX-92, PNL 100, and Freesound.org. The VoxCeleb dataset is available to download for commercial/research purposes under a Creative Commons Attribution 4.0 International License. A complete version of the license and the audio files are available at the link below: https://www.robots.ox.ac.uk/~vgg/data/voxceleb/. Examples of the NOISEX database used in this study and the license information are available at the links below: http://spib.linse.ufsc.br/noise.html. http://www.speech.cs.cmu.edu/comp.speech/Section1/Data/noisex.html. The PNL 100, Collected by Guoning Hu during his dissertation study, is available to download without any restrictions at the link below: http://web.cse.ohio-state.edu/pnl/corpus/HuNonspeech/HuCorpus.html. The Freesound is a collaborative database of Creative Commons Licensed sound: https://freesound.org/.

Notes

References

A.K.H. Al-Ali, D. Dean, B. Senadji, V. Chandran, G.R. Naik, Enhanced forensic speaker verification using a combination of DWT and MFCC feature warping in the presence of noise and reverberation conditions. IEEE Access 5, 15400–15413 (2017). https://doi.org/10.1109/ACCESS.2017.2728801
Article Google Scholar
G. Biagetti, P. Crippa, L. Falaschetti, S. Orcioni, C. Turchetti, Speaker identification in noisy conditions using short sequences of speech frames, in International Conference on Intelligent Decision Technologies (2017), pp. 43–52. https://doi.org/10.1007/978-3-319-59424-8_5
P.M. Bousquet, D. Matrouf, J.F. Bonastre, Intersession compensation and scoring methods in the i-vectors space for speaker recognition, in INTERSPEECH (2011), pp. 485–488
J.S. Chung, A. Nagrani, A. Zisserman, Voxceleb2: Deep speaker recognition, in INTERSPEECH (2018), pp. 1086–1090. https://doi.org/10.21437/Interspeech.2018-1929
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Language Process 19(4), 778–798 (2011). https://doi.org/10.1109/TASL.2010.2064307
Article Google Scholar
“Freesound.org,” http://freesound.org
D. Garcia-Romero, C.Y. Espy-Wilson, Analysis of i-vector length normalization in speaker recognition systems, in INTERSPEECH (2011), pp. 249–252
D. Garcia-Romero, X. Zhou, C.Y. Espy-Wilson, Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 4257–4260. https://doi.org/10.1109/ICASSP.2012.6288859
C.S. Greenberg, D. Bansé, G.R. Doddington, D. Garcia-Romero, J.J. Godfrey, T. Kinnunen, A.F. Martin, A. McCree, M. Przybocki, D.A. Reynolds, The NIST 2014 speaker recognition i-vector machine learning challenge, in The Speaker and Language Recognition Workshop (Odyssey) (2014), pp. 224–230
R.M. Hanifa, K. Isa, S. Mohamad, A review on speaker recognition: Technology and challenges. Comput. Electr. Eng. 90, 107005 (2021). https://doi.org/10.1016/j.compeleceng.2021.107005
Article Google Scholar
J.H.L. Hansen, T. Hasan, Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. 32(6), 74–99 (2015). https://doi.org/10.1109/MSP.2015.2462851
Article Google Scholar
G. Hu, D. Wang, A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans. Audio Speech Lang. Process. 18, 2067–2079 (2010). https://doi.org/10.1109/TASL.2010.2041110
Article Google Scholar
Y. Ke, A. Li, C. Zheng, R. Peng, X. Li, Low-complexity artificial noise suppression methods for deep learning-based speech enhancement algorithms. EURASIP J. Audio Speech Music Process. 2021, 1–15 (2021). https://doi.org/10.1186/s13636-021-00204-9
Article Google Scholar
P. Kenny, Bayesian speaker verification with heavy-tailed priors, in The Speaker and Language Recognition Workshop (Odyssey) (2010), pp. 1–10
T. Kinnunen, R. Saeidi, F. Sedlak, L. Kong Aik, J. Sandberg, M. Hansson-Sandsten, H. Li, Low-variance multitaper MFCC features: A case study in robust speaker verification. IEEE Trans. Audio Speech Language Process. 20, 1990–2001 (2012). https://doi.org/10.1109/TASL.2012.2191960
Article Google Scholar
Z. Lei, Y. Wan, J. Luo, Y. Yang, Mahalanobis metric scoring learned from weighted pairwise constraints in I-vector speaker recognition system, in INTERSPEECH (2016), pp. 1815–1819. https://doi.org/10.21437/Interspeech.2016-1071
N. Li, M. Mak, SNR-invariant PLDA with multiple speaker subspaces, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5565–5569. https://doi.org/10.1109/ICASSP.2016.7472742
E. Lleida, L.J. Rodriguez-Fuentes, Speaker and language recognition and characterization: introduction to the CSL special issue. Comput. Speech Lang. 49, 107–120 (2018). https://doi.org/10.1016/j.csl.2017.12.001
Article Google Scholar
M. McLaren, Y. Lei, L. Ferrer, Advances in deep neural network approaches to speaker recognition, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), pp. 4814–4818. https://doi.org/10.1109/ICASSP.2015.7178885
M. McLaren, D. Van Leeuwen, Source-normalised-and-weighted LDA for robust speaker recognition using i-vectors, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2011), pp. 5456–5459. https://doi.org/10.1109/ICASSP.2011.5947593
N. McLaughlin, J. Ming, D. Crookes, Speaker recognition in noisy conditions with limited training data, in 19th European Signal Processing Conference (2011), pp. 1294–1298.
J. Ming, T.J. Hazen, J.R. Glass, D.A. Reynolds, Robust speaker recognition in noisy conditions. IEEE Trans. Audio Speech Lang. Process. 15, 1711–1723 (2007). https://doi.org/10.1109/TASL.2007.899278
Article Google Scholar
M. Mohammadamini, D. Matrouf, P.-G. Noé, Denoising x-vectors for Robust Speaker Recognition, in The Speaker and Language Recognition Workshop (Odyssey) (2020), pp. 75–80. https://doi.org/10.21437/odyssey.2020-11
M. Mohammadi, H.R. Sadegh Mohammadi, Weighted I-vector based text-independent speaker verification system, in Iranian Conference on Electrical Engineering (ICEE) (2019), pp. 1647–1653. https://doi.org/10.1109/IranianCEE.2019.8786420
A. Nagrani, J.S. Chung, W. Xie, A. Zisserman, Voxceleb: Large-scale speaker verification in the wild. Comput. Speech Lang. 60, 101027 (2020). https://doi.org/10.1016/j.csl.2019.101027
Article Google Scholar
NIST 2016 speaker recognition evaluation plan, https://www.nist.gov/itl/iad/mig/speaker-recognition. Accessed 2016
O. Novotný, O. Plchot, O. Glembek, L. Burget, Analysis of DNN speech signal enhancement for robust speaker recognition. Comput. Speech Lang. 58, 403–421 (2019). https://doi.org/10.1016/j.csl.2019.06.004
Article Google Scholar
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, The Kaldi speech recognition toolkit, in IEEE workshop on automatic speech recognition and understanding (ASRU) (2011)
P. Rajan, A. Afanasyev, V. Hautamäki, T. Kinnunen, From single to multiple enrollment i-vectors: practical PLDA scoring variants for speaker verification. Digital Signal Processing 31, 93–101 (2014). https://doi.org/10.1016/j.dsp.2014.05.001
Article Google Scholar
P. Rajan, T. Kinnunen, V. Hautamäki, Effect of multicondition training on i-vector PLDA configurations for speaker recognition, in INTERSPEECH (2013), pp. 3694–3697
M. Ravanelli, Y. Bengio, Speaker recognition from raw waveform with SincNet, in IEEE Spoken Language Technology Workshop (SLT) (2018), pp. 1021–1028. https://doi.org/10.1109/SLT.2018.8639585
F. Richardson, D. Reynolds, N. Dehak, Deep neural network approaches to speaker and language recognition. IEEE Signal Process. Lett. 22(10), 1671–1675 (2015). https://doi.org/10.1109/LSP.2015.2420092
Article Google Scholar
M. Rouvier, R. Dufour, P.-M. Bousquet, Review of different robust x-vector extractors for speaker verification, in 28^th European Signal Processing Conference (EUSIPCO) (2020), pp. 1–5. https://doi.org/10.23919/Eusipco47968.2020.9287426
Y. Shi, Q. Huang, T. Hain, Improving noise robustness in speaker identification using a two-stage attention model. arXiv preprint arXiv:1909.11200v2, (2019)
D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, Deep neural network embeddings for text-independent speaker verification, in INTERSPEECH (2017), pp. 999–1003. https://doi.org/10.21437/Interspeech.2017-620
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-Vectors: Robust DNN embeddings for speaker recognition, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp. 5329–5333. https://doi.org/10.1109/ICASSP.2018.8461375
A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993). https://doi.org/10.1016/0167-6393(93)90095-3
Article Google Scholar
E. Variani, X. Lei, E. McDermott, I.L. Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 4052–4056. https://doi.org/10.1109/ICASSP.2014.6854363
B. Vesnicer, J. Zganec-Gros, S. Dobrisek, V. Struc, Incorporating duration information into i-vector based speaker recognition systems, in The Speaker and Language Recognition Workshop (Odyssey) (2014), pp. 241–248
H. Zeinali, A. Mirian, H. Sameti, B. BabaAli, Non-speaker information reduction from cosine similarity scoring in i-vector based speaker verification. Comput. Electr. Eng. 48, 226–238 (2015). https://doi.org/10.1016/j.compeleceng.2015.09.003
Article Google Scholar
H. Zeinali, S. Wang, A. Silnova, P. Matějka, O. Plchot, But system description to voxceleb speaker recognition challenge 2019. arXiv preprint arXiv:1910.12592, (2019)
C. Zeng, X. Wang, E. Cooper, J. Yamagishi, Attention back-end for automatic speaker verification with multiple enrollment utterances. arXiv preprint arXiv:2104.01541 (2021)
J. Zhou, T. Jiang, Q. Hong, L. Li, Extraction of noise-robust speaker embedding based on generative adversarial networks, in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (2019), pp. 1641–1645. https://doi.org/10.1109/APSIPAASC47483.2019.9023295

Download references

Author information

Authors and Affiliations

Iranian Research Institute for Electrical Engineering, ACECR, Tehran, Islamic Republic of Iran
Mohsen Mohammadi & Hamid Reza Sadegh Mohammadi

Authors

Mohsen Mohammadi
View author publications
You can also search for this author inPubMed Google Scholar
Hamid Reza Sadegh Mohammadi
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Hamid Reza Sadegh Mohammadi.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mohammadi, M., Sadegh Mohammadi, H.R. Weighted X-Vectors for Robust Text-Independent Speaker Verification with Multiple Enrollment Utterances. Circuits Syst Signal Process 41, 2825–2844 (2022). https://doi.org/10.1007/s00034-021-01915-2

Download citation

Received: 17 April 2021
Revised: 16 November 2021
Accepted: 18 November 2021
Published: 16 January 2022
Issue Date: May 2022
DOI: https://doi.org/10.1007/s00034-021-01915-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weighted X-Vectors for Robust Text-Independent Speaker Verification with Multiple Enrollment Utterances

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A deep learning approach for speaker recognition

A tibetan-dependent speaker recognition method based on deep learning

DNN and i-vector combined method for speaker recognition on multi-variability environments

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now