Improving Deep CNN Architectures with Variable-Length Training Samples for Text-Independent Speaker Verification

Wu, Yanfeng; Zhao, Junan; Guo, Chenkai; Xu, Jing

doi:10.21437/Interspeech.2021-559

Improving Deep CNN Architectures with Variable-Length Training Samples for Text-Independent Speaker Verification

Yanfeng Wu, Junan Zhao, Chenkai Guo, Jing Xu

Deep Convolutional Neural Network (CNN) based speaker embeddings, such as r-vectors, have shown great success in text-independent speaker verification (TI-SV) task. However, previous deep CNN models usually use fixed-length samples for training and employ variable-length utterances for speaker embeddings, which generates a mismatch between training and embedding. To address this issue, we investigate the effect of employing variable-length training samples on CNN-based TI-SV systems and explore two approaches to improve the performance of deep CNN architectures on TI-SV through capturing variable-term contexts. Firstly, we present an improved selective kernel convolution which allows the networks to adaptively switch between short-term and long-term contexts based on variable-length utterances. Secondly, we propose a multi-scale statistics pooling method to aggregate multiple time-scale features from different layers of the networks. We build a novel ResNet34 based architecture with two proposed approaches. Experiments are conducted on the VoxCeleb datasets. The results demonstrate that the effect of using variable-length samples is diverse in different networks and the architecture with two proposed approaches achieves significant improvement over r-vectors baseline system.

doi: 10.21437/Interspeech.2021-559

Cite as: Wu, Y., Zhao, J., Guo, C., Xu, J. (2021) Improving Deep CNN Architectures with Variable-Length Training Samples for Text-Independent Speaker Verification. Proc. Interspeech 2021, 81-85, doi: 10.21437/Interspeech.2021-559

@inproceedings{wu21_interspeech,
  author={Yanfeng Wu and Junan Zhao and Chenkai Guo and Jing Xu},
  title={{Improving Deep CNN Architectures with Variable-Length Training Samples for Text-Independent Speaker Verification}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={81--85},
  doi={10.21437/Interspeech.2021-559}
}