Often, the storage and computational constraints of embedded devices demand that a single on-device ASR model serve multiple use-cases / domains. In this paper, we propose a Flexible Transducer (FlexiT) for on-device automatic speech recognition to flexibly deal with multiple use-cases / domains with different accuracy and latency requirements. Specifically, using a single compact model, FlexiT provides a fast response for voice commands, and accurate transcription but with more latency for dictation. In order to achieve flexible and better accuracy and latency trade-offs, the following techniques are used. Firstly, we propose using domain-specific altering of segment size for Emformer encoder that enables FlexiT to achieve flexible decoding. Secondly, we use Alignment Restricted RNNT loss to achieve flexible fine-grained control on token emission latency for different domains. Finally, we add a domain indicator vector as an additional input to the FlexiT model. Using the combination of techniques, we show that a single model can be used to improve WERs and real time factor for dictation scenarios while maintaining optimal latency for voice commands use-cases.
Cite as: Mahadeokar, J., Shi, Y., Shangguan, Y., Wu, C., Xiao, A., Su, H., Le, D., Kalinli, O., Fuegen, C., Seltzer, M.L. (2021) Flexi-Transducer: Optimizing Latency, Accuracy and Compute for Multi-Domain On-Device Scenarios. Proc. Interspeech 2021, 2107-2111, doi: 10.21437/Interspeech.2021-1921
@inproceedings{mahadeokar21_interspeech, author={Jay Mahadeokar and Yangyang Shi and Yuan Shangguan and Chunyang Wu and Alex Xiao and Hang Su and Duc Le and Ozlem Kalinli and Christian Fuegen and Michael L. Seltzer}, title={{Flexi-Transducer: Optimizing Latency, Accuracy and Compute for Multi-Domain On-Device Scenarios}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={2107--2111}, doi={10.21437/Interspeech.2021-1921} }