ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Acoustic Feature Shuffling Network for Text-independent Speaker Verification

Jin Li, Xin Fang, Fan Chu, Tian Gao, Yan Song, Rong Li Dai

Deep embedding learning methods have shown state-of-the-art performance for text-independent speaker verification(SV) tasks, compared to the traditional i-vectors. Existing methods mainly focus on designing frame-level feature extraction structures, utterance-level aggregation methods and loss functions to learn effective speaker embeddings. However, due to the locality property of frame-level extraction, the resulting embeddings will be different if we shuffle the sequential order of the input utterance. On the contrary, the conventional i-vector methods are order-insensitive. In this paper, we propose an acoustic feature shuffling network to learn the order-insensitive speaker embeddings via a joint learning method. Specifically, the input utterance is first organized into multi-scale segments. Then, these segments are randomly shuffled to form the input of the deep embedding learning architecture. A symmetric Kullback-Leibler(KL-)Divergence loss, in addition to the Cross-Entropy (CE) loss, is used to force the learned architecture to be order-insensitive. Experimental results of benchmark Voxceleb corpus demonstrate the effectiveness of the proposed acoustic feature shuffling network.


doi: 10.21437/Interspeech.2022-10278

Cite as: Li, J., Fang, X., Chu, F., Gao, T., Song, Y., Dai, R.L. (2022) Acoustic Feature Shuffling Network for Text-independent Speaker Verification. Proc. Interspeech 2022, 4790-4794, doi: 10.21437/Interspeech.2022-10278

@inproceedings{li22r_interspeech,
  author={Jin Li and Xin Fang and Fan Chu and Tian Gao and Yan Song and Rong Li Dai},
  title={{Acoustic Feature Shuffling Network for Text-independent Speaker Verification}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={4790--4794},
  doi={10.21437/Interspeech.2022-10278}
}