An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling

Sainath, Tara N.; He, Yanzhang; Narayanan, Arun; Botros, Rami; Pang, Ruoming; Rybach, David; Allauzen, Cyril; Variani, Ehsan; Qin, James; Le-The, Quoc-Nam; Chang, Shuo-Yiin; Li, Bo; Gulati, Anmol; Yu, Jiahui; Chiu, Chung-Cheng; Caseiro, Diamantino; Li, Wei; Liang, Qiao; Rondon, Pat

doi:10.21437/Interspeech.2021-206

An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling

Tara N. Sainath, Yanzhang He, Arun Narayanan, Rami Botros, Ruoming Pang, David Rybach, Cyril Allauzen, Ehsan Variani, James Qin, Quoc-Nam Le-The, Shuo-Yiin Chang, Bo Li, Anmol Gulati, Jiahui Yu, Chung-Cheng Chiu, Diamantino Caseiro, Wei Li, Qiao Liang, Pat Rondon

On-device end-to-end (E2E) models have shown improvements over a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER) [1], and latency [2], measured by the time the result is finalized after the user stops speaking. However, the E2E model is trained on a small fraction of audio-text pairs compared to the 100 billion text utterances that a conventional language model (LM) is trained with. Thus E2E models perform poorly on rare words and phrases. In this paper, building upon the two-pass streaming Cascaded Encoder E2E model [3], we explore using a Hybrid Autoregressive Transducer (HAT) [4] factorization to better integrate an on-device neural LM trained on text-only data. Furthermore, to further improve decoder latency we introduce a non-recurrent embedding decoder, in place of the typical LSTM decoder, into the Cascaded Encoder model. Overall, we present a streaming on-device model that incorporates an external neural LM and outperforms the conventional model in both search and rare-word quality, as well as latency, and is 318× smaller.

doi: 10.21437/Interspeech.2021-206

Cite as: Sainath, T.N., He, Y., Narayanan, A., Botros, R., Pang, R., Rybach, D., Allauzen, C., Variani, E., Qin, J., Le-The, Q.-N., Chang, S.-Y., Li, B., Gulati, A., Yu, J., Chiu, C.-C., Caseiro, D., Li, W., Liang, Q., Rondon, P. (2021) An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling. Proc. Interspeech 2021, 1777-1781, doi: 10.21437/Interspeech.2021-206

@inproceedings{sainath21_interspeech,
  author={Tara N. Sainath and Yanzhang He and Arun Narayanan and Rami Botros and Ruoming Pang and David Rybach and Cyril Allauzen and Ehsan Variani and James Qin and Quoc-Nam Le-The and Shuo-Yiin Chang and Bo Li and Anmol Gulati and Jiahui Yu and Chung-Cheng Chiu and Diamantino Caseiro and Wei Li and Qiao Liang and Pat Rondon},
  title={{An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1777--1781},
  doi={10.21437/Interspeech.2021-206}
}