Abstract:
Video question answering, aiming to answer a natural language question related to the given video, has gained popularity in the last few years. Although significant impro...Show MoreMetadata
Abstract:
Video question answering, aiming to answer a natural language question related to the given video, has gained popularity in the last few years. Although significant improvements have been achieved, it is still confronted with two challenges: the sufficient comprehension of video content and the long-tailed answers. To this end, we propose a multi-granularity interaction and integration network for video question answering. It jointly explores multi-level intra-granularity and inter-granularity relations to enhance the comprehension of videos. To be specific, we first build a word-enhanced visual representation module to achieve cross-modal alignment. And then we advance a multi-granularity interaction module to explore the intra-granularity and inter-granularity relationships. Finally, a question-guided interaction module is developed to select question-related visual representations for answer prediction. In addition, we employ the seesaw loss for open-ended tasks to alleviate the long-tailed word distribution effect. Both the quantitative and qualitative results on TGIF-QA, MSRVTT-QA, and MSVD-QA datasets demonstrate the superiority of our model over several state-of-the-art approaches.
Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Volume: 33, Issue: 12, December 2023)