Attentive Feature Fusion for Robust Speaker Verification

Liu, Bei; Chen, Zhengyang; Qian, Yanmin

doi:10.21437/Interspeech.2022-478

Attentive Feature Fusion for Robust Speaker Verification

Bei Liu, Zhengyang Chen, Yanmin Qian

As the most widely used technique, deep speaker embedding learning has become predominant in speaker verification task recently. This approach utilizes deep neural networks to extract fixed dimension embedding vectors which represent different speaker identities. Two network architectures such as ResNet and ECAPA-TDNN have been commonly adopted in prior studies and achieved the state-of-the-art performance. One omnipresent part, feature fusion, plays an important role in both of them. For example, shortcut connections are designed to fuse the identity mapping of inputs and outputs of residual blocks in ResNet. ECAPA-TDNN employs the multi-layer feature aggregation to integrate shallow feature maps with deep ones. Traditional feature fusion is often implemented via simple operations, such as element-wise addition or concatenation. In this paper, we propose a more effective feature fusion scheme, namely Attentive Feature Fusion (AFF), to render dynamic weighted fusion of different features. It utilizes attention modules to learn fusion weights based on the feature contents. Additionally, two fusion strategies are designed: sequential fusion and parallel fusion. Experiments on Voxceleb dataset show that our proposed attentive feature fusion scheme can result in up to 40% relative improvement over the baseline systems.

doi: 10.21437/Interspeech.2022-478

Cite as: Liu, B., Chen, Z., Qian, Y. (2022) Attentive Feature Fusion for Robust Speaker Verification. Proc. Interspeech 2022, 286-290, doi: 10.21437/Interspeech.2022-478

@inproceedings{liu22f_interspeech,
  author={Bei Liu and Zhengyang Chen and Yanmin Qian},
  title={{Attentive Feature Fusion for Robust Speaker Verification}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={286--290},
  doi={10.21437/Interspeech.2022-478}
}