Abstract:
The speaker characterization using four different data augmentation methods and time delay neural networks and long short-term memory neural networks (TDNN-LSTM) is propo...Show MoreMetadata
Abstract:
The speaker characterization using four different data augmentation methods and time delay neural networks and long short-term memory neural networks (TDNN-LSTM) is proposed in this paper. The proposed data augmentation is used to increase the amount and diversity of the training data including adding speed perturbation, adding volume perturbation, adding room impulse responses, and adding additive noises. The idea of TDNN-LSTM based speaker embedding is better to capture the temporal information in speaker speech than the conventional TDNN based x-vectors. The proposed methods were trained on VoxCeleb dataset and tested with Speakers In The Wild (SITW) dataset in the evaluation core-core condition. We achieved results of EER=1.86% and a minimum decision cost function (DCF) of 0.204 at p-target=0.01, and a minimum DCF of 0.368 at p-target=0.001. The proposed methods outperform the baselines of both i-vector and x-vector.
Date of Conference: 14-18 December 2019
Date Added to IEEE Xplore: 20 February 2020
ISBN Information: