Multimodal fusion of audio, vision, and text has demonstrated significant benefits in advancing the performance of several tasks, including machine translation, video captioning, and video summarization. Audio-Visual Scene-aware Dialog (AVSD) is a new and more challenging task, proposed recently, that focuses on generating sentence responses to questions that are asked in a dialog about video content. While prior approaches designed to tackle this task have shown the need for multimodal fusion to improve response quality, the best-performing systems often rely heavily on human-generated summaries of the video content, which are unavailable when such systems are deployed in real-world. This paper investigates how to compensate for such information, which is missing in the inference phase but available during the training phase. To this end, we propose a novel AVSD system using student-teacher learning, in which a student network is (jointly) trained to mimic the teacher’s responses. Our experiments demonstrate that in addition to yielding state-of-the-art accuracy against the baseline DSTC7-AVSD system, the proposed approach (which does not use human-generated summaries at test time) performs competitively with methods that do use those summaries.
Cite as: Hori, C., Cherian, A., Marks, T.K., Hori, T. (2019) Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog. Proc. Interspeech 2019, 1886-1890, doi: 10.21437/Interspeech.2019-3143
@inproceedings{hori19_interspeech, author={Chiori Hori and Anoop Cherian and Tim K. Marks and Takaaki Hori}, title={{Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog}}, year=2019, booktitle={Proc. Interspeech 2019}, pages={1886--1890}, doi={10.21437/Interspeech.2019-3143} }