Neural Speaker Extraction with Speaker-Speech Cross-Attention Network

Wang, Wupeng; Xu, Chenglin; Ge, Meng; Li, Haizhou

doi:10.21437/Interspeech.2021-2260

Neural Speaker Extraction with Speaker-Speech Cross-Attention Network

Wupeng Wang, Chenglin Xu, Meng Ge, Haizhou Li

In this paper, we propose a novel time-domain speaker-speech cross-attention network as a variant of SpEx [1] architecture, that features speaker-speech cross-attention. The speaker-speech cross-attention network consists of speech semantic layers that capture the high-level dependency of audio feature, and cross-attention layers that fuse speaker embedding and speech features to estimate the speaker mask. We implement cross-attention layers with both parallel and sequential concatenation techniques. Experiments show that the proposed models consistently outperform the state-of-the-art time-domain speaker extraction baseline on WSJ0-2mix dataset.

doi: 10.21437/Interspeech.2021-2260

Cite as: Wang, W., Xu, C., Ge, M., Li, H. (2021) Neural Speaker Extraction with Speaker-Speech Cross-Attention Network. Proc. Interspeech 2021, 3535-3539, doi: 10.21437/Interspeech.2021-2260

@inproceedings{wang21aa_interspeech,
  author={Wupeng Wang and Chenglin Xu and Meng Ge and Haizhou Li},
  title={{Neural Speaker Extraction with Speaker-Speech Cross-Attention Network}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={3535--3539},
  doi={10.21437/Interspeech.2021-2260}
}