Audio-Visual Multi-Talker Speech Recognition in a Cocktail Party

Wu, Yifei; Li, Chenda; Yang, Song; Wu, Zhongqin; Qian, Yanmin

doi:10.21437/Interspeech.2021-2128

Audio-Visual Multi-Talker Speech Recognition in a Cocktail Party

Yifei Wu, Chenda Li, Song Yang, Zhongqin Wu, Yanmin Qian

Speech from microphones is vulnerable in a complex acoustic environment due to noise and reverberation, while the cameras are not. Thus, utilizing the visual modality in the “cocktail party” scenario with multi-talkers has become a promising and popular approach. In this paper, we have explored the incorporating of visual modality into the end-to-end multi-talker speech recognition task. We propose two methods based on the modality fusion position, which are encoder-based fusion and decoder-based fusion. And for each method, advanced audio-visual fusion techniques including attention mechanism and dual decoder have been explored to find the best usage of the visual modality. With the proposed methods, our best audio-visual multi-talker automatic speech recognition (ASR) model gets almost ~50.0% word error rate (WER) reduction compared to the audio-only multi-talker ASR system.

doi: 10.21437/Interspeech.2021-2128

Cite as: Wu, Y., Li, C., Yang, S., Wu, Z., Qian, Y. (2021) Audio-Visual Multi-Talker Speech Recognition in a Cocktail Party. Proc. Interspeech 2021, 3021-3025, doi: 10.21437/Interspeech.2021-2128

@inproceedings{wu21e_interspeech,
  author={Yifei Wu and Chenda Li and Song Yang and Zhongqin Wu and Yanmin Qian},
  title={{Audio-Visual Multi-Talker Speech Recognition in a Cocktail Party}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={3021--3025},
  doi={10.21437/Interspeech.2021-2128}
}