In this paper, we propose a novel time-domain speaker-speech cross-attention network as a variant of SpEx [1] architecture, that features speaker-speech cross-attention. The speaker-speech cross-attention network consists of speech semantic layers that capture the high-level dependency of audio feature, and cross-attention layers that fuse speaker embedding and speech features to estimate the speaker mask. We implement cross-attention layers with both parallel and sequential concatenation techniques. Experiments show that the proposed models consistently outperform the state-of-the-art time-domain speaker extraction baseline on WSJ0-2mix dataset.
Cite as: Wang, W., Xu, C., Ge, M., Li, H. (2021) Neural Speaker Extraction with Speaker-Speech Cross-Attention Network. Proc. Interspeech 2021, 3535-3539, doi: 10.21437/Interspeech.2021-2260
@inproceedings{wang21aa_interspeech, author={Wupeng Wang and Chenglin Xu and Meng Ge and Haizhou Li}, title={{Neural Speaker Extraction with Speaker-Speech Cross-Attention Network}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={3535--3539}, doi={10.21437/Interspeech.2021-2260} }