Abstract:
One of the emerging challenges in automatic speaker recognition is the development of systems that are robust to noisy and far-field conditions. The current standard for ...Show MoreMetadata
Abstract:
One of the emerging challenges in automatic speaker recognition is the development of systems that are robust to noisy and far-field conditions. The current standard for x-vector speaker embedding is based on a time-delay neural network (TDNN) and is less robust than systems based on a residual network (ResNet) and other baseline systems that use signal enhancement preprocessing in presence of these conditions. In this study, we improve the performance of TDNN-based embedding by integrating a residual block with a time-restricted self-attention option (AttResBlock) into the TDNN frame level. Experiments using the Voices Obscured in Complex Environmental Settings (VOiCES) corpus are carried out to evaluate the proposed speaker embedding extractor (AttResBlock-TDNN). The experimental results show that AttResBlock-TDNN outperforms state-of-the-art systems under many adverse conditions. For instance, the proposed AttResBlock-TDNN produces relative improvements in the minDCF and EER of 11.4% and 15.5%, respectively, over the original TDNN-based encoder.
Date of Conference: 18-20 September 2022
Date Added to IEEE Xplore: 17 October 2022
ISBN Information: