Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language Understanding

Zhu, Yi; Wang, Zexun; Liu, Hang; Wang, Peiying; Feng, Mingchao; Chen, Meng; He, Xiaodong

doi:10.21437/Interspeech.2022-11378

Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language Understanding

Yi Zhu, Zexun Wang, Hang Liu, Peiying Wang, Mingchao Feng, Meng Chen, Xiaodong He

End-to-end spoken language understanding (E2E-SLU) has witnessed impressive improvements through cross-modal (text-to-audio) transfer learning. However, current methods mostly focus on coarse-grained sequence-level text-to-audio knowledge transfer with simple loss, and neglecting the fine-grained temporal alignment between two modalities. In this work, we propose a novel multi-grained cross-modal transfer learning model for E2E-SLU. Specifically, we devise a cross attention module to align the tokens of text with the frame features of speech, encouraging the model to target at the salient acoustic features attended to each token during transferring the semantic information. We also leverage contrastive learning to facilitate cross-modal representation learning in sentence level. Finally, we explore various data augmentation methods to mitigate the deficiency of large amount of labelled data for the training of E2E-SLU. Extensive experiments are conducted on both English and Chinese SLU datasets to verify the effectiveness of our proposed approach. Experimental results and detailed analyses demonstrate the superiority and competitiveness of our model.

doi: 10.21437/Interspeech.2022-11378

Cite as: Zhu, Y., Wang, Z., Liu, H., Wang, P., Feng, M., Chen, M., He, X. (2022) Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language Understanding. Proc. Interspeech 2022, 1131-1135, doi: 10.21437/Interspeech.2022-11378

@inproceedings{zhu22f_interspeech,
  author={Yi Zhu and Zexun Wang and Hang Liu and Peiying Wang and Mingchao Feng and Meng Chen and Xiaodong He},
  title={{Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language Understanding}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={1131--1135},
  doi={10.21437/Interspeech.2022-11378},
  issn={2308-457X}
}