Abstract
Speech is a user-friendly signal for identity recognition with low computational complexity and implementation cost. However, the use of speech samples to identify persons involves several limitations, such as degraded performance in real environments due to the presence of different noises and channel effects. In recent years, deep neural network (DNN)-based approaches have provided good results in speaker verification and outperformed the i-vector based methods. The x-vector is a DNN-based speaker embedding that, in combination with probabilistic linear discriminant analysis (PLDA), increases both the accuracy and robustness of speaker verification systems. In this paper, we propose weighted x-vectors as a method for enhancing the speaker verification system in both clean and noisy environments. It exploits the statistical properties of target speaker enrollment x-vectors for weighting the test x-vector to enhance the scoring accuracy and thus the whole verification system. Experiments were conducted using the VoxCeleb dataset, MFCC feature vectors, and PLDA scoring method. The VoxCeleb is a large-scale dataset that contains real-world short-duration speech samples from over 6,000 speakers. Multicondition training for LDA and PLDA was also employed to improve the system’s performance under mismatched noisy circumstances. The findings showed that using weighted x-vectors led to 18% and 10% reductions in equal error rate (EER) term for clean and noisy conditions, respectively. Also, the experiments show that the increase of the number of enrollment x-vectors results in superior performance of the proposed method.




Similar content being viewed by others
Data availability
In this study, four audio datasets were used: VoxCeleb, NOISEX-92, PNL 100, and Freesound.org. The VoxCeleb dataset is available to download for commercial/research purposes under a Creative Commons Attribution 4.0 International License. A complete version of the license and the audio files are available at the link below: https://www.robots.ox.ac.uk/~vgg/data/voxceleb/. Examples of the NOISEX database used in this study and the license information are available at the links below: http://spib.linse.ufsc.br/noise.html. http://www.speech.cs.cmu.edu/comp.speech/Section1/Data/noisex.html. The PNL 100, Collected by Guoning Hu during his dissertation study, is available to download without any restrictions at the link below: http://web.cse.ohio-state.edu/pnl/corpus/HuNonspeech/HuCorpus.html. The Freesound is a collaborative database of Creative Commons Licensed sound: https://freesound.org/.
References
A.K.H. Al-Ali, D. Dean, B. Senadji, V. Chandran, G.R. Naik, Enhanced forensic speaker verification using a combination of DWT and MFCC feature warping in the presence of noise and reverberation conditions. IEEE Access 5, 15400–15413 (2017). https://doi.org/10.1109/ACCESS.2017.2728801
G. Biagetti, P. Crippa, L. Falaschetti, S. Orcioni, C. Turchetti, Speaker identification in noisy conditions using short sequences of speech frames, in International Conference on Intelligent Decision Technologies (2017), pp. 43–52. https://doi.org/10.1007/978-3-319-59424-8_5
P.M. Bousquet, D. Matrouf, J.F. Bonastre, Intersession compensation and scoring methods in the i-vectors space for speaker recognition, in INTERSPEECH (2011), pp. 485–488
J.S. Chung, A. Nagrani, A. Zisserman, Voxceleb2: Deep speaker recognition, in INTERSPEECH (2018), pp. 1086–1090. https://doi.org/10.21437/Interspeech.2018-1929
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Language Process 19(4), 778–798 (2011). https://doi.org/10.1109/TASL.2010.2064307
“Freesound.org,” http://freesound.org
D. Garcia-Romero, C.Y. Espy-Wilson, Analysis of i-vector length normalization in speaker recognition systems, in INTERSPEECH (2011), pp. 249–252
D. Garcia-Romero, X. Zhou, C.Y. Espy-Wilson, Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 4257–4260. https://doi.org/10.1109/ICASSP.2012.6288859
C.S. Greenberg, D. Bansé, G.R. Doddington, D. Garcia-Romero, J.J. Godfrey, T. Kinnunen, A.F. Martin, A. McCree, M. Przybocki, D.A. Reynolds, The NIST 2014 speaker recognition i-vector machine learning challenge, in The Speaker and Language Recognition Workshop (Odyssey) (2014), pp. 224–230
R.M. Hanifa, K. Isa, S. Mohamad, A review on speaker recognition: Technology and challenges. Comput. Electr. Eng. 90, 107005 (2021). https://doi.org/10.1016/j.compeleceng.2021.107005
J.H.L. Hansen, T. Hasan, Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. 32(6), 74–99 (2015). https://doi.org/10.1109/MSP.2015.2462851
G. Hu, D. Wang, A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans. Audio Speech Lang. Process. 18, 2067–2079 (2010). https://doi.org/10.1109/TASL.2010.2041110
Y. Ke, A. Li, C. Zheng, R. Peng, X. Li, Low-complexity artificial noise suppression methods for deep learning-based speech enhancement algorithms. EURASIP J. Audio Speech Music Process. 2021, 1–15 (2021). https://doi.org/10.1186/s13636-021-00204-9
P. Kenny, Bayesian speaker verification with heavy-tailed priors, in The Speaker and Language Recognition Workshop (Odyssey) (2010), pp. 1–10
T. Kinnunen, R. Saeidi, F. Sedlak, L. Kong Aik, J. Sandberg, M. Hansson-Sandsten, H. Li, Low-variance multitaper MFCC features: A case study in robust speaker verification. IEEE Trans. Audio Speech Language Process. 20, 1990–2001 (2012). https://doi.org/10.1109/TASL.2012.2191960
Z. Lei, Y. Wan, J. Luo, Y. Yang, Mahalanobis metric scoring learned from weighted pairwise constraints in I-vector speaker recognition system, in INTERSPEECH (2016), pp. 1815–1819. https://doi.org/10.21437/Interspeech.2016-1071
N. Li, M. Mak, SNR-invariant PLDA with multiple speaker subspaces, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5565–5569. https://doi.org/10.1109/ICASSP.2016.7472742
E. Lleida, L.J. Rodriguez-Fuentes, Speaker and language recognition and characterization: introduction to the CSL special issue. Comput. Speech Lang. 49, 107–120 (2018). https://doi.org/10.1016/j.csl.2017.12.001
M. McLaren, Y. Lei, L. Ferrer, Advances in deep neural network approaches to speaker recognition, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), pp. 4814–4818. https://doi.org/10.1109/ICASSP.2015.7178885
M. McLaren, D. Van Leeuwen, Source-normalised-and-weighted LDA for robust speaker recognition using i-vectors, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2011), pp. 5456–5459. https://doi.org/10.1109/ICASSP.2011.5947593
N. McLaughlin, J. Ming, D. Crookes, Speaker recognition in noisy conditions with limited training data, in 19th European Signal Processing Conference (2011), pp. 1294–1298.
J. Ming, T.J. Hazen, J.R. Glass, D.A. Reynolds, Robust speaker recognition in noisy conditions. IEEE Trans. Audio Speech Lang. Process. 15, 1711–1723 (2007). https://doi.org/10.1109/TASL.2007.899278
M. Mohammadamini, D. Matrouf, P.-G. Noé, Denoising x-vectors for Robust Speaker Recognition, in The Speaker and Language Recognition Workshop (Odyssey) (2020), pp. 75–80. https://doi.org/10.21437/odyssey.2020-11
M. Mohammadi, H.R. Sadegh Mohammadi, Weighted I-vector based text-independent speaker verification system, in Iranian Conference on Electrical Engineering (ICEE) (2019), pp. 1647–1653. https://doi.org/10.1109/IranianCEE.2019.8786420
A. Nagrani, J.S. Chung, W. Xie, A. Zisserman, Voxceleb: Large-scale speaker verification in the wild. Comput. Speech Lang. 60, 101027 (2020). https://doi.org/10.1016/j.csl.2019.101027
NIST 2016 speaker recognition evaluation plan, https://www.nist.gov/itl/iad/mig/speaker-recognition. Accessed 2016
O. Novotný, O. Plchot, O. Glembek, L. Burget, Analysis of DNN speech signal enhancement for robust speaker recognition. Comput. Speech Lang. 58, 403–421 (2019). https://doi.org/10.1016/j.csl.2019.06.004
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, The Kaldi speech recognition toolkit, in IEEE workshop on automatic speech recognition and understanding (ASRU) (2011)
P. Rajan, A. Afanasyev, V. Hautamäki, T. Kinnunen, From single to multiple enrollment i-vectors: practical PLDA scoring variants for speaker verification. Digital Signal Processing 31, 93–101 (2014). https://doi.org/10.1016/j.dsp.2014.05.001
P. Rajan, T. Kinnunen, V. Hautamäki, Effect of multicondition training on i-vector PLDA configurations for speaker recognition, in INTERSPEECH (2013), pp. 3694–3697
M. Ravanelli, Y. Bengio, Speaker recognition from raw waveform with SincNet, in IEEE Spoken Language Technology Workshop (SLT) (2018), pp. 1021–1028. https://doi.org/10.1109/SLT.2018.8639585
F. Richardson, D. Reynolds, N. Dehak, Deep neural network approaches to speaker and language recognition. IEEE Signal Process. Lett. 22(10), 1671–1675 (2015). https://doi.org/10.1109/LSP.2015.2420092
M. Rouvier, R. Dufour, P.-M. Bousquet, Review of different robust x-vector extractors for speaker verification, in 28th European Signal Processing Conference (EUSIPCO) (2020), pp. 1–5. https://doi.org/10.23919/Eusipco47968.2020.9287426
Y. Shi, Q. Huang, T. Hain, Improving noise robustness in speaker identification using a two-stage attention model. arXiv preprint arXiv:1909.11200v2, (2019)
D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, Deep neural network embeddings for text-independent speaker verification, in INTERSPEECH (2017), pp. 999–1003. https://doi.org/10.21437/Interspeech.2017-620
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-Vectors: Robust DNN embeddings for speaker recognition, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp. 5329–5333. https://doi.org/10.1109/ICASSP.2018.8461375
A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993). https://doi.org/10.1016/0167-6393(93)90095-3
E. Variani, X. Lei, E. McDermott, I.L. Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 4052–4056. https://doi.org/10.1109/ICASSP.2014.6854363
B. Vesnicer, J. Zganec-Gros, S. Dobrisek, V. Struc, Incorporating duration information into i-vector based speaker recognition systems, in The Speaker and Language Recognition Workshop (Odyssey) (2014), pp. 241–248
H. Zeinali, A. Mirian, H. Sameti, B. BabaAli, Non-speaker information reduction from cosine similarity scoring in i-vector based speaker verification. Comput. Electr. Eng. 48, 226–238 (2015). https://doi.org/10.1016/j.compeleceng.2015.09.003
H. Zeinali, S. Wang, A. Silnova, P. Matějka, O. Plchot, But system description to voxceleb speaker recognition challenge 2019. arXiv preprint arXiv:1910.12592, (2019)
C. Zeng, X. Wang, E. Cooper, J. Yamagishi, Attention back-end for automatic speaker verification with multiple enrollment utterances. arXiv preprint arXiv:2104.01541 (2021)
J. Zhou, T. Jiang, Q. Hong, L. Li, Extraction of noise-robust speaker embedding based on generative adversarial networks, in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (2019), pp. 1641–1645. https://doi.org/10.1109/APSIPAASC47483.2019.9023295
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mohammadi, M., Sadegh Mohammadi, H.R. Weighted X-Vectors for Robust Text-Independent Speaker Verification with Multiple Enrollment Utterances. Circuits Syst Signal Process 41, 2825–2844 (2022). https://doi.org/10.1007/s00034-021-01915-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-021-01915-2