Journals & Magazines >IEEE Transactions on Image Pr... >Volume: 31

Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question Answering

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Due to the rich spatio-temporal visual content and complex multimodal relations, Video Question Answering (VideoQA) has become a challenging task and attracted increasing...Show More

Metadata

Abstract:

Due to the rich spatio-temporal visual content and complex multimodal relations, Video Question Answering (VideoQA) has become a challenging task and attracted increasing attention. Current methods usually leverage visual attention, linguistic attention, or self-attention to uncover latent correlations between video content and question semantics. Although these methods exploit interactive information between different modalities to improve comprehension ability, inter- and intra-modality correlations cannot be effectively integrated in a uniform model. To address this problem, we propose a novel VideoQA model called Cross-Attentional Spatio-Temporal Semantic Graph Networks (CASSG). Specifically, a multi-head multi-hop attention module with diversity and progressivity is first proposed to explore fine-grained interactions between different modalities in a crossing manner. Then, heterogeneous graphs are constructed from the cross-attended video frames, clips, and question words, in which the multi-stream spatio-temporal semantic graphs are designed to synchronously reasoning inter- and intra-modality correlations. Last, the global and local information fusion method is proposed to coalesce the local reasoning vector learned from multi-stream spatio-temporal semantic graphs and the global vector learned from another branch to infer the answer. Experimental results on three public VideoQA datasets confirm the effectiveness and superiority of our model compared with state-of-the-art methods.

Published in: IEEE Transactions on Image Processing ( Volume: 31)

Page(s): 1684 - 1696

Date of Publication: 19 January 2022

ISSN Information:

PubMed ID: 35044914

DOI: 10.1109/TIP.2022.3142526

Funding Agency:

Contents

References is not available for this document.

Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question Answering

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question Answering

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?