Speech from microphones is vulnerable in a complex acoustic environment due to noise and reverberation, while the cameras are not. Thus, utilizing the visual modality in the “cocktail party” scenario with multi-talkers has become a promising and popular approach. In this paper, we have explored the incorporating of visual modality into the end-to-end multi-talker speech recognition task. We propose two methods based on the modality fusion position, which are encoder-based fusion and decoder-based fusion. And for each method, advanced audio-visual fusion techniques including attention mechanism and dual decoder have been explored to find the best usage of the visual modality. With the proposed methods, our best audio-visual multi-talker automatic speech recognition (ASR) model gets almost ~50.0% word error rate (WER) reduction compared to the audio-only multi-talker ASR system.
Cite as: Wu, Y., Li, C., Yang, S., Wu, Z., Qian, Y. (2021) Audio-Visual Multi-Talker Speech Recognition in a Cocktail Party. Proc. Interspeech 2021, 3021-3025, doi: 10.21437/Interspeech.2021-2128
@inproceedings{wu21e_interspeech, author={Yifei Wu and Chenda Li and Song Yang and Zhongqin Wu and Yanmin Qian}, title={{Audio-Visual Multi-Talker Speech Recognition in a Cocktail Party}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={3021--3025}, doi={10.21437/Interspeech.2021-2128} }