Flexi-Transducer: Optimizing Latency, Accuracy and Compute for Multi-Domain On-Device Scenarios

Mahadeokar, Jay; Shi, Yangyang; Shangguan, Yuan; Wu, Chunyang; Xiao, Alex; Su, Hang; Le, Duc; Kalinli, Ozlem; Fuegen, Christian; Seltzer, Michael L.

doi:10.21437/Interspeech.2021-1921

Flexi-Transducer: Optimizing Latency, Accuracy and Compute for Multi-Domain On-Device Scenarios

Jay Mahadeokar, Yangyang Shi, Yuan Shangguan, Chunyang Wu, Alex Xiao, Hang Su, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer

Often, the storage and computational constraints of embedded devices demand that a single on-device ASR model serve multiple use-cases / domains. In this paper, we propose a Flexible Transducer (FlexiT) for on-device automatic speech recognition to flexibly deal with multiple use-cases / domains with different accuracy and latency requirements. Specifically, using a single compact model, FlexiT provides a fast response for voice commands, and accurate transcription but with more latency for dictation. In order to achieve flexible and better accuracy and latency trade-offs, the following techniques are used. Firstly, we propose using domain-specific altering of segment size for Emformer encoder that enables FlexiT to achieve flexible decoding. Secondly, we use Alignment Restricted RNNT loss to achieve flexible fine-grained control on token emission latency for different domains. Finally, we add a domain indicator vector as an additional input to the FlexiT model. Using the combination of techniques, we show that a single model can be used to improve WERs and real time factor for dictation scenarios while maintaining optimal latency for voice commands use-cases.

doi: 10.21437/Interspeech.2021-1921

Cite as: Mahadeokar, J., Shi, Y., Shangguan, Y., Wu, C., Xiao, A., Su, H., Le, D., Kalinli, O., Fuegen, C., Seltzer, M.L. (2021) Flexi-Transducer: Optimizing Latency, Accuracy and Compute for Multi-Domain On-Device Scenarios. Proc. Interspeech 2021, 2107-2111, doi: 10.21437/Interspeech.2021-1921

@inproceedings{mahadeokar21_interspeech,
  author={Jay Mahadeokar and Yangyang Shi and Yuan Shangguan and Chunyang Wu and Alex Xiao and Hang Su and Duc Le and Ozlem Kalinli and Christian Fuegen and Michael L. Seltzer},
  title={{Flexi-Transducer: Optimizing Latency, Accuracy and Compute for Multi-Domain On-Device Scenarios}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={2107--2111},
  doi={10.21437/Interspeech.2021-1921}
}