Multi-task Learning with Augmentation Strategy for Acoustic-to-word Attention-based Encoder-decoder Speech Recognition

Moriya, Takafumi; Ueno, Sei; Shinohara, Yusuke; Delcroix, Marc; Yamaguchi, Yoshikazu; Aono, Yushi

doi:10.21437/Interspeech.2018-1866

Multi-task Learning with Augmentation Strategy for Acoustic-to-word Attention-based Encoder-decoder Speech Recognition

Takafumi Moriya, Sei Ueno, Yusuke Shinohara, Marc Delcroix, Yoshikazu Yamaguchi, Yushi Aono

In this paper, we propose a novel training strategy for attention-based encoder-decoder acoustic-to-word end-to-end systems. Accuracy of end-to-end systems has greatly improved thanks to careful tuning of model structure and the introduction of novel training strategies to stabilize training. For example, multi-task learning using a shared-encoder is often used to escape from bad local optima. However, multi-task learning usually relies on a linear interpolation of the losses for each sub-task and consequently, the shared-encoder is not optimized for each task. To solve the above problem, we propose a multi-task learning with augmentation strategy. We augment the training data by creating multiple copies of the original training data to suit different output targets associated with each sub-task. We use each target loss sequentially to update the parameters of the shared-encoder so as to enhance the versatility of capturing acoustic features. This strategy enables better learning of the shared-encoder as each task is trained with a dedicated loss. The parameters of the word-decoder are jointly updated via the shared-encoder when optimizing the word prediction task loss. We evaluate our proposal on various speech data sets and show that our models achieve lower word error rates than both single-task and conventional multi-task approaches.

doi: 10.21437/Interspeech.2018-1866

Cite as: Moriya, T., Ueno, S., Shinohara, Y., Delcroix, M., Yamaguchi, Y., Aono, Y. (2018) Multi-task Learning with Augmentation Strategy for Acoustic-to-word Attention-based Encoder-decoder Speech Recognition. Proc. Interspeech 2018, 2399-2403, doi: 10.21437/Interspeech.2018-1866

@inproceedings{moriya18_interspeech,
  author={Takafumi Moriya and Sei Ueno and Yusuke Shinohara and Marc Delcroix and Yoshikazu Yamaguchi and Yushi Aono},
  title={{Multi-task Learning with Augmentation Strategy for Acoustic-to-word Attention-based Encoder-decoder Speech Recognition}},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={2399--2403},
  doi={10.21437/Interspeech.2018-1866}
}