Abstract:
Self-supervised learning has shown promising performance on speaker verification tasks, among which Self-DIstillation with NO labels (DINO) is currently a widely adopted ...Show MoreMetadata
Abstract:
Self-supervised learning has shown promising performance on speaker verification tasks, among which Self-DIstillation with NO labels (DINO) is currently a widely adopted framework. As one of the unsupervised deep clustering methods, the number of valid prototypes in DINO is far less than the speakers in practical applications and remains unchanged throughout the training period, leading to severe speaker confusion and performance degradation. Therefore, a strategy named prototype division (PD) is proposed to iteratively generate fine-grained prototypes in the projection space based on the converged model to separate confused categories, where new prototypes are derived from the neighborhood of the existing valid prototypes by clustering or sampling. The results on Vox1O achieve significant improvements, relatively outperforming the baseline by 31.1% without any auxiliary loss. Further experiments on CN-Celeb also show stable improvement, proving the consistency of the proposed method.
Published in: IEEE Signal Processing Letters ( Volume: 31)