Towards Efficiently Learning Monotonic Alignments for Attention-based End-to-End Speech Recognition

Miao, Chenfeng; Zou, Kun; Zhuang, Ziyang; Wei, Tao; Ma, Jun; Wang, Shaojun; Xiao, Jing

doi:10.21437/Interspeech.2022-11259

Towards Efficiently Learning Monotonic Alignments for Attention-based End-to-End Speech Recognition

Chenfeng Miao, Kun Zou, Ziyang Zhuang, Tao Wei, Jun Ma, Shaojun Wang, Jing Xiao

Inspired by EfficientTTS, a recent proposed speech synthesis model, we propose a new way to train end-to-end speech recognition models with an additional training objective, allowing the models to learn the monotonic alignments effectively and efficiently. The introduced training objective is differential, computationally cheap and most importantly, of no constraint on network structures. Thus, it is quite convenient to be incorporated into any speech recognition model. Through extensive experiments, we observed that the performance of our models significantly outperform baseline models. Specifically, our best performing model achieves WER (Word Error Rate) 3.18% on LibriSpeech test-clean benchmark and 8.41% on test-other. Comparing with a strong baseline obtained by WeNet, the proposed model gets 7.6% relative WER reduction on test-clean and 6.9% on test-other.

doi: 10.21437/Interspeech.2022-11259

Cite as: Miao, C., Zou, K., Zhuang, Z., Wei, T., Ma, J., Wang, S., Xiao, J. (2022) Towards Efficiently Learning Monotonic Alignments for Attention-based End-to-End Speech Recognition. Proc. Interspeech 2022, 1051-1055, doi: 10.21437/Interspeech.2022-11259

@inproceedings{miao22c_interspeech,
  author={Chenfeng Miao and Kun Zou and Ziyang Zhuang and Tao Wei and Jun Ma and Shaojun Wang and Jing Xiao},
  title={{Towards Efficiently Learning Monotonic Alignments for Attention-based End-to-End Speech Recognition}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={1051--1055},
  doi={10.21437/Interspeech.2022-11259}
}