Abstract:
Long-term Video Question Answering plays an essential role in visual information retrieval, which aims at generating natural language answers to discretionary free-form q...Show MoreMetadata
Abstract:
Long-term Video Question Answering plays an essential role in visual information retrieval, which aims at generating natural language answers to discretionary free-form questions about the referenced long-term video. Rather than remember the video as a sequence of visual content, humans have an innate cognitive ability to identify the critical moments related to the question at first glance, then tie together the specific evidence around these critical moments for further analysis and reasoning. Motivated by this intuition, we propose the multimodal hierarchical memory attentive networks with two heterogeneous memory subnetworks: the top guided memory network and the bottom enhanced multimodal memory attentive network. The top guided memory network serves as a shallow inference engine to pick relevant and informative moments of questions and obtain salient video content at a coarse-grained level. Subsequently, the bottom enhanced multimodal memory attentive network is designed as an in-depth reasoning engine to perform more accurate attention with cues from video bottom evidence in a fine-grained level to enhance question answering quality. We evaluate the proposed method on three publicly available video question answering benchmarks, namely ActivityNet-QA, MSRVTT-QA, and MSVD-QA. Experimental results demonstrate that the proposed approach significantly outperforms other state-of-the-art methods for long-term videos. Extensive ablation studies are carried out to explore the reasons behind the proposed model's effectiveness.
Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Volume: 31, Issue: 3, March 2021)