research-article

Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement Network

Authors:

Meng Liu,

Fenglei Zhang,

Xin Luo,

Fan Liu,

Yinwei Wei,

Liqiang NieAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 3985 - 3993

https://doi.org/10.1145/3581783.3612239

Published: 27 October 2023 Publication History

Get Access

Abstract

Video question answering is an increasingly vital research field, spurred by the rapid proliferation of video content online and the urgent need for intelligent systems that can comprehend and interact with this content. Existing methodologies often lean towards video understanding and cross-modal information interaction modeling but tend to overlook the crucial aspect of comprehensive question understanding. To address this gap, we introduce the multi-modal and multi-layer question enhancement network, a groundbreaking framework emphasizing nuanced question understanding. Our approach begins by extracting object, appearance, and motion features from videos. Subsequently, we harness multi-layer outputs from a pre-trained language model, ensuring a thorough grasp of the question. Integrating object data into appearance is guided by global question and frame representation, facilitating the adaptive acquisition of appearance and motion-enhanced question representation. By amalgamating multi-modal question insights, our methodology adeptly determines answers to questions. Experimental results conducted on three benchmarks demonstrate the superiority of our tailored approach, underscoring the importance of advanced question comprehension in VideoQA.

Supplemental Material

MP4 File

We introduce the research background and discuss the definition and framework of video question answering. Next, we review related work and identify the challenges in the current field of video question answering. Then, we propose an innovative multi-modal and multi-layer question enhancement network, emphasizing nuanced question understanding. We provide detailed descriptions of the feature extraction, interaction, and decoding processes, and showcase the experimental results of this method on three benchmark datasets to demonstrate its superiority. The significance of advanced question understanding in VideoQA is specifically emphasized.

Download
126.13 MB

References

[1]

Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019a. Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1999--2007.

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Keyword-aware Multi-modal Enhancement Attention for Video Question Answering

Interactive Multi-modal Question-Answering

Question Difficulty Estimation with Directional Modality Association in Video Question Answering

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations