Abstract:
Domain shift poses a significant challenge in speaker verification, especially in open-set scenarios where the speaker categories are disjoint between the source and targ...Show MoreMetadata
Abstract:
Domain shift poses a significant challenge in speaker verification, especially in open-set scenarios where the speaker categories are disjoint between the source and target domains. To alleviate the domain shift, traditional domain adaptation methods typically align the source and target distributions in the speaker embedding space, but this may cause the overlap of embeddings from different speakers. To address this problem, this paper proposes to perform the domain alignment in a novel distance metric space, where the source and target domains exhibit the shared within-speaker and between-speaker categories. Thus, the discrepancy between the source and target domains arises only from the domain shift. We refer to the proposed method as Cross-Domain Distance Metric Adaptation (CDMA), in which the within- and between-speaker distance distributions in the target domain are aligned with the source distance distributions and further separated to minimize their overlap. This alignment and separation require estimating the within- and between-speaker distance distributions based on speaker labels, which are unavailable in the unlabeled target domain. Thus, we further propose a learnable speaker clustering method called Graph Convolutional Network with Graph Pruning (GCN-GP). This method generates high-quality pseudo-labels to estimate the two distance distributions in the target domain. Experimental results demonstrate that our method achieves state-of-the-art performance on the FFSVC2022 and VOiCES datasets.
Published in: IEEE/ACM Transactions on Audio, Speech, and Language Processing ( Volume: 32)