Mixture Model Attention: Flexible Streaming and Non-Streaming Automatic Speech Recognition

Audhkhasi, Kartik; Chen, Tongzhou; Ramabhadran, Bhuvana; Moreno, Pedro J.

doi:10.21437/Interspeech.2021-720

Mixture Model Attention: Flexible Streaming and Non-Streaming Automatic Speech Recognition

Kartik Audhkhasi, Tongzhou Chen, Bhuvana Ramabhadran, Pedro J. Moreno

Streaming automatic speech recognition (ASR) hypothesizes words as soon as the input audio arrives, whereas non-streaming ASR can potentially wait for the completion of the entire utterance to hypothesize words. Streaming and non-streaming ASR systems have typically used different acoustic encoders. Recent work has attempted to unify them by either jointly training a fixed stack of streaming and non-streaming layers or using knowledge distillation during training to ensure consistency between the streaming and non-streaming predictions. We propose mixture model (MiMo) attention as a simpler and theoretically-motivated alternative that replaces only the attention mechanism, requires no change to the training loss, and allows greater flexibility of switching between streaming and non-streaming mode during inference. Our experiments on the public Librispeech data set and a few Indic language data sets show that MiMo attention endows a single ASR model with the ability to operate in both streaming and non-streaming modes without any overhead and without significant loss in accuracy compared to separately-trained streaming and non-streaming models. We also illustrate this benefit of MiMo attention in a second-pass rescoring setting.

doi: 10.21437/Interspeech.2021-720

Cite as: Audhkhasi, K., Chen, T., Ramabhadran, B., Moreno, P.J. (2021) Mixture Model Attention: Flexible Streaming and Non-Streaming Automatic Speech Recognition. Proc. Interspeech 2021, 1812-1816, doi: 10.21437/Interspeech.2021-720

@inproceedings{audhkhasi21_interspeech,
  author={Kartik Audhkhasi and Tongzhou Chen and Bhuvana Ramabhadran and Pedro J. Moreno},
  title={{Mixture Model Attention: Flexible Streaming and Non-Streaming Automatic Speech Recognition}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1812--1816},
  doi={10.21437/Interspeech.2021-720}
}