Abstract
One of the most widely used self-supervised methods to train a speaker verification system is to generate the pseudo-labels using unsupervised clustering algorithms and train the speaker embedding network using the pseudo-labels in a discriminative fashion. Although the pseudo-label-based self-supervised speaker embedding extraction scheme have shown impressive performance, not much exploration was done regarding the pseudo-label generation process. In this paper, we have conducted a set of experiments using several clustering algorithms to analyze the impact of different clustering configurations for the pseudo-label-based self-supervised speaker verification system training strategy. From the experimental results, we observe that the performance of the self-supervised speaker embedding system heavily depends on the accuracy of the pseudo-labels, and the performance can be severely degraded when overfitting to the inaccurately generated pseudo-labels.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
The voxceleb speaker recognition challenge 2021 (voxsrc-21). https://www.robots.ox.ac.uk/~vgg/data/voxceleb/competition2021.html
Alam, J., Fathan, A., Kang, W.H.: Text-independent speaker verification employing CNN-LSTM-TDNN hybrid networks. In: Karpov, A., Potapova, R. (eds.) SPECOM 2021. LNCS (LNAI), vol. 12997, pp. 1–13. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87802-3_1
Cai, D., Li, M.: The DKU-DukeECE system for the self-supervision speaker verification task of the 2021 voxceleb speaker recognition challenge (2021)
Chung, J.S., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. In: INTERSPEECH (2018)
Day, W.H.E., Edelsbrunner, H.: Efficient algorithms for agglomerative hierarchical clustering methods. J. Classif. 1, 7–24 (1984)
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011). https://doi.org/10.1109/TASL.2010.2064307
Deng, J., Guo, J., Yang, J., Xue, N., Cotsia, I., Zafeiriou, S.P.: ArcFace: additive angular margin loss for deep face recognition. IEEE Trans. Pattern Anal. Mach. Intell., 1 (2021). https://doi.org/10.1109/TPAMI.2021.3087709
Desplanques, B., Thienpondt, J., Demuynck, K.: ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In: Meng, H., Xu, B., Zheng, T.F. (eds.) Interspeech 2020, pp. 3830–3834. ISCA (2020)
Ding, K., He, X., Wan, G.: Learning speaker embedding with momentum contrast (2020)
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. SIGMOD Rec. 27(2), 73–84 (1998). https://doi.org/10.1145/276305.276312
Hansen, J.H., Hasan, T.: Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. 32(6), 74–99 (2015). https://doi.org/10.1109/MSP.2015.2462851
Hartigan, J.A., Wong, M.A.: A k-means clustering algorithm. JSTOR Appl. Stat. 28(1), 100–108 (1979)
Huh, J., Heo, H.S., Kang, J., Watanabe, S., Chung, J.S.: Augmentation adversarial training for unsupervised speaker recognition. In: Workshop on Self-Supervised Learning for Speech and Audio Processing, NeurIPS (2020)
Kenny, P.: A small footprint i-vector extractor. In: Odyssey (2012)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (2015). http://arxiv.org/abs/1412.6980
Mun, S.H., Kang, W.H., Han, M.H., Kim, N.S.: Unsupervised representation learning for speaker recognition via contrastive equilibrium learning (2020)
Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. In: INTERSPEECH (2017)
Park, D.S., et al.: SpecAugment: a simple data augmentation method for automatic speech recognition. In: Interspeech 2019, pp. 2613–2617 (2019)
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc. (2019)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 workshop (2011)
Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. In: INTERSPEECH (2017)
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: Robust DNN embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333 (2018). https://doi.org/10.1109/ICASSP.2018.8461375
Tao, R., Lee, K.A., Das, R.K., Hautamäki, V., Li, H.: Self-supervised speaker recognition with loss-gated learning (2021)
Thienpondt, J., Desplanques, B., Demuynck, K.: The IDLAB VoxCeleb speaker recognition challenge 2020 system description (2020)
Zhang, H., Zou, Y., Wang, H.: Contrastive self-supervised learning for text-independent speaker verification. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6713–6717 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413351
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: a new data clustering algorithm and its applications. Data Min. Knowl. Discov. 1(2), 141–182 (1997)
Acknowledgments
The authors wish to acknowledge the funding from the Government of Canada’s New Frontiers in Research Fund (NFRF) through grant NFRFR-2021-00338 and Ministry of Economy and Innovation (MEI) of the Government of Quebec for the continued support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Kang, W.H., Alam, J., Fathan, A. (2022). An Analytic Study on Clustering-Based Pseudo-labels for Self-supervised Deep Speaker Verification. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_29
Download citation
DOI: https://doi.org/10.1007/978-3-031-20980-2_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20979-6
Online ISBN: 978-3-031-20980-2
eBook Packages: Computer ScienceComputer Science (R0)