Joint Feature Enhancement and Speaker Recognition with Multi-Objective Task-Oriented Network

Wu, Yibo; Wang, Longbiao; Lee, Kong Aik; Liu, Meng; Dang, Jianwu

doi:10.21437/Interspeech.2021-1978

Joint Feature Enhancement and Speaker Recognition with Multi-Objective Task-Oriented Network

Yibo Wu, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang

Recently, increasing attention has been paid to the joint training of upstream and downstream tasks, and to address the challenge of how to synchronize various loss functions in a multi-objective scenario. In this paper, to address the competing gradient directions between the speaker classification loss and the feature enhancement loss, we propose an asynchronous subregion optimization approach for the joint training of feature enhancement and speaker embedding neural networks. For the asynchronous subregion optimization, the squeeze and excitation (SE) method is introduced in the enhancement network to adaptively select important channels for speaker embedding. Furthermore, channel-wise feature concatenation is applied between the input feature and the enhanced feature to address the distortion of speaker information that is caused by enhancement loss. By using the proposed joint training network with asynchronous subregion optimization and channel-wise feature concatenation, we obtained relative gains of 11.95% and 6.43% in equal error rate on a noisy version of Voxceleb1 and VOiCES corpus, respectively.

doi: 10.21437/Interspeech.2021-1978

Cite as: Wu, Y., Wang, L., Lee, K.A., Liu, M., Dang, J. (2021) Joint Feature Enhancement and Speaker Recognition with Multi-Objective Task-Oriented Network. Proc. Interspeech 2021, 1089-1093, doi: 10.21437/Interspeech.2021-1978

@inproceedings{wu21c_interspeech,
  author={Yibo Wu and Longbiao Wang and Kong Aik Lee and Meng Liu and Jianwu Dang},
  title={{Joint Feature Enhancement and Speaker Recognition with Multi-Objective Task-Oriented Network}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1089--1093},
  doi={10.21437/Interspeech.2021-1978}
}