Mixup is a learning strategy that constructs additional virtual training
samples from existing training samples by linearly interpolating random
pairs of them. It has been shown that mixup can help avoid data memorization
and thus improve model generalization. This paper investigates the
mixup learning strategy in training speaker-discriminative deep neural
network (DNN) for better text-independent speaker verification.
In recent speaker verification systems, a DNN is usually trained
to classify speakers in the training set. The DNN, at the same time,
learns a low-dimensional embedding of speakers so that speaker embeddings
can be generated for any speakers during evaluation. We adapted the
mixup strategy to the speaker-discriminative DNN training procedure,
and studied different mixup schemes, such as performing mixup on MFCC
features or raw audio samples. The mixup learning strategy was evaluated
on NIST SRE 2010, 2016 and SITW evaluation sets. Experimental results
show consistent performance improvements both in terms of EER and DCF
of up to 13% relative. We further find that mixup training also improves
the DNN’s speaker classification accuracy consistently without
requiring any additional data sources.
Cite as: Zhu, Y., Ko, T., Mak, B. (2019) Mixup Learning Strategies for Text-Independent Speaker Verification. Proc. Interspeech 2019, 4345-4349, doi: 10.21437/Interspeech.2019-2250
@inproceedings{zhu19b_interspeech, author={Yingke Zhu and Tom Ko and Brian Mak}, title={{Mixup Learning Strategies for Text-Independent Speaker Verification}}, year=2019, booktitle={Proc. Interspeech 2019}, pages={4345--4349}, doi={10.21437/Interspeech.2019-2250} }