Real-Time End-to-End Monaural Multi-Speaker Speech Recognition

Li, Song; Ouyang, Beibei; Tong, Fuchuan; Liao, Dexin; Li, Lin; Hong, Qingyang

doi:10.21437/Interspeech.2021-1449

Real-Time End-to-End Monaural Multi-Speaker Speech Recognition

Song Li, Beibei Ouyang, Fuchuan Tong, Dexin Liao, Lin Li, Qingyang Hong

The rising interest in single-channel multi-speaker speech separation has triggered the development of end-to-end multi-speaker automatic speech recognition (ASR). However, until now, most systems have adopted autoregressive mechanisms for decoding, resulting in slow decoding speed, which is not conducive to the application of multi-speaker speech recognition in real-world environments. In this paper, we first comprehensively investigate and compare the mainstream end-to-end multi-speaker speech recognition systems. Secondly, we improve the recently proposed non-autoregressive end-to-end speech recognition model Mask-CTC, and introduce it to multi-speaker speech recognition to achieve real-time decoding. Our experiments on the LibriMix data set show that under the premise of the same amount of parameters, the non-autoregressive model achieves performance close to that of the autoregressive model while having a faster decoding speed.

doi: 10.21437/Interspeech.2021-1449

Cite as: Li, S., Ouyang, B., Tong, F., Liao, D., Li, L., Hong, Q. (2021) Real-Time End-to-End Monaural Multi-Speaker Speech Recognition. Proc. Interspeech 2021, 3750-3754, doi: 10.21437/Interspeech.2021-1449

@inproceedings{li21l_interspeech,
  author={Song Li and Beibei Ouyang and Fuchuan Tong and Dexin Liao and Lin Li and Qingyang Hong},
  title={{Real-Time End-to-End Monaural Multi-Speaker Speech Recognition}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={3750--3754},
  doi={10.21437/Interspeech.2021-1449}
}