An Analytic Study on Clustering-Based Pseudo-labels for Self-supervised Deep Speaker Verification

Kang, Woo Hyun; Alam, Jahangir; Fathan, Abderrahim

doi:10.1007/978-3-031-20980-2_29

Woo Hyun Kang¹¹,
Jahangir Alam¹¹ &
Abderrahim Fathan¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13721))

Included in the following conference series:

International Conference on Speech and Computer

1102 Accesses

Abstract

One of the most widely used self-supervised methods to train a speaker verification system is to generate the pseudo-labels using unsupervised clustering algorithms and train the speaker embedding network using the pseudo-labels in a discriminative fashion. Although the pseudo-label-based self-supervised speaker embedding extraction scheme have shown impressive performance, not much exploration was done regarding the pseudo-label generation process. In this paper, we have conducted a set of experiments using several clustering algorithms to analyze the impact of different clustering configurations for the pseudo-label-based self-supervised speaker verification system training strategy. From the experimental results, we observe that the performance of the self-supervised speaker embedding system heavily depends on the accuracy of the pseudo-labels, and the performance can be severely degraded when overfitting to the inaccurately generated pseudo-labels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Self-supervised Speaker Verification Employing Augmentation Mix and Self-augmented Training-Based Clustering

Open-Set Speaker Identification Using Closed-Set Pretrained Embeddings

An Unsupervised Domain Adaptation Method Based on Distribution Alignment for Speaker Verification

Notes

1.
https://github.com/joonson/voxceleb_unsupervised.

References

The voxceleb speaker recognition challenge 2021 (voxsrc-21). https://www.robots.ox.ac.uk/~vgg/data/voxceleb/competition2021.html
Alam, J., Fathan, A., Kang, W.H.: Text-independent speaker verification employing CNN-LSTM-TDNN hybrid networks. In: Karpov, A., Potapova, R. (eds.) SPECOM 2021. LNCS (LNAI), vol. 12997, pp. 1–13. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87802-3_1
Chapter Google Scholar
Cai, D., Li, M.: The DKU-DukeECE system for the self-supervision speaker verification task of the 2021 voxceleb speaker recognition challenge (2021)
Google Scholar
Chung, J.S., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. In: INTERSPEECH (2018)
Google Scholar
Day, W.H.E., Edelsbrunner, H.: Efficient algorithms for agglomerative hierarchical clustering methods. J. Classif. 1, 7–24 (1984)
Article MATH Google Scholar
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011). https://doi.org/10.1109/TASL.2010.2064307
Article Google Scholar
Deng, J., Guo, J., Yang, J., Xue, N., Cotsia, I., Zafeiriou, S.P.: ArcFace: additive angular margin loss for deep face recognition. IEEE Trans. Pattern Anal. Mach. Intell., 1 (2021). https://doi.org/10.1109/TPAMI.2021.3087709
Desplanques, B., Thienpondt, J., Demuynck, K.: ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In: Meng, H., Xu, B., Zheng, T.F. (eds.) Interspeech 2020, pp. 3830–3834. ISCA (2020)
Google Scholar
Ding, K., He, X., Wan, G.: Learning speaker embedding with momentum contrast (2020)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. SIGMOD Rec. 27(2), 73–84 (1998). https://doi.org/10.1145/276305.276312
Hansen, J.H., Hasan, T.: Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. 32(6), 74–99 (2015). https://doi.org/10.1109/MSP.2015.2462851
Article Google Scholar
Hartigan, J.A., Wong, M.A.: A k-means clustering algorithm. JSTOR Appl. Stat. 28(1), 100–108 (1979)
Google Scholar
Huh, J., Heo, H.S., Kang, J., Watanabe, S., Chung, J.S.: Augmentation adversarial training for unsupervised speaker recognition. In: Workshop on Self-Supervised Learning for Speech and Audio Processing, NeurIPS (2020)
Google Scholar
Kenny, P.: A small footprint i-vector extractor. In: Odyssey (2012)
Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (2015). http://arxiv.org/abs/1412.6980
Mun, S.H., Kang, W.H., Han, M.H., Kim, N.S.: Unsupervised representation learning for speaker recognition via contrastive equilibrium learning (2020)
Google Scholar
Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. In: INTERSPEECH (2017)
Google Scholar
Park, D.S., et al.: SpecAugment: a simple data augmentation method for automatic speech recognition. In: Interspeech 2019, pp. 2613–2617 (2019)
Google Scholar
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc. (2019)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 workshop (2011)
Google Scholar
Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. In: INTERSPEECH (2017)
Google Scholar
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: Robust DNN embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333 (2018). https://doi.org/10.1109/ICASSP.2018.8461375
Tao, R., Lee, K.A., Das, R.K., Hautamäki, V., Li, H.: Self-supervised speaker recognition with loss-gated learning (2021)
Google Scholar
Thienpondt, J., Desplanques, B., Demuynck, K.: The IDLAB VoxCeleb speaker recognition challenge 2020 system description (2020)
Google Scholar
Zhang, H., Zou, Y., Wang, H.: Contrastive self-supervised learning for text-independent speaker verification. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6713–6717 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413351
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: a new data clustering algorithm and its applications. Data Min. Knowl. Discov. 1(2), 141–182 (1997)
Google Scholar

Download references

Acknowledgments

The authors wish to acknowledge the funding from the Government of Canada’s New Frontiers in Research Fund (NFRF) through grant NFRFR-2021-00338 and Ministry of Economy and Innovation (MEI) of the Government of Quebec for the continued support.

Author information

Authors and Affiliations

Computer Research Institute of Montreal (CRIM), Montreal, QC, Canada
Woo Hyun Kang, Jahangir Alam & Abderrahim Fathan

Authors

Woo Hyun Kang
View author publications
You can also search for this author in PubMed Google Scholar
Jahangir Alam
View author publications
You can also search for this author in PubMed Google Scholar
Abderrahim Fathan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jahangir Alam .

Editor information

Editors and Affiliations

Indian Institute of Technology Dharwad, Dharwad, India
S. R. Mahadeva Prasanna
St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Koneru Lakshmaiah Education Foundation, Vaddeswaram, India
K. Samudravijaya
KIIT Group of Colleges, Gurugram, India
Shyam S. Agrawal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kang, W.H., Alam, J., Fathan, A. (2022). An Analytic Study on Clustering-Based Pseudo-labels for Self-supervised Deep Speaker Verification. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_29

Download citation

DOI: https://doi.org/10.1007/978-3-031-20980-2_29
Published: 10 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20979-6
Online ISBN: 978-3-031-20980-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Analytic Study on Clustering-Based Pseudo-labels for Self-supervised Deep Speaker Verification