Conferences >2019 IEEE Automatic Speech Re...

Exploring Effective Data Augmentation with TDNN-LSTM Neural Network Embedding for Speaker Recognition

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

The speaker characterization using four different data augmentation methods and time delay neural networks and long short-term memory neural networks (TDNN-LSTM) is propo...Show More

Metadata

Abstract:

The speaker characterization using four different data augmentation methods and time delay neural networks and long short-term memory neural networks (TDNN-LSTM) is proposed in this paper. The proposed data augmentation is used to increase the amount and diversity of the training data including adding speed perturbation, adding volume perturbation, adding room impulse responses, and adding additive noises. The idea of TDNN-LSTM based speaker embedding is better to capture the temporal information in speaker speech than the conventional TDNN based x-vectors. The proposed methods were trained on VoxCeleb dataset and tested with Speakers In The Wild (SITW) dataset in the evaluation core-core condition. We achieved results of EER=1.86% and a minimum decision cost function (DCF) of 0.204 at p-target=0.01, and a minimum DCF of 0.368 at p-target=0.001. The proposed methods outperform the baselines of both i-vector and x-vector.

Published in: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Date of Conference: 14-18 December 2019

Date Added to IEEE Xplore: 20 February 2020

ISBN Information:

DOI: 10.1109/ASRU46091.2019.9003938

Conference Location: Singapore