skip to main content
10.1145/3581783.3612239acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement Network

Published: 27 October 2023 Publication History

Abstract

Video question answering is an increasingly vital research field, spurred by the rapid proliferation of video content online and the urgent need for intelligent systems that can comprehend and interact with this content. Existing methodologies often lean towards video understanding and cross-modal information interaction modeling but tend to overlook the crucial aspect of comprehensive question understanding. To address this gap, we introduce the multi-modal and multi-layer question enhancement network, a groundbreaking framework emphasizing nuanced question understanding. Our approach begins by extracting object, appearance, and motion features from videos. Subsequently, we harness multi-layer outputs from a pre-trained language model, ensuring a thorough grasp of the question. Integrating object data into appearance is guided by global question and frame representation, facilitating the adaptive acquisition of appearance and motion-enhanced question representation. By amalgamating multi-modal question insights, our methodology adeptly determines answers to questions. Experimental results conducted on three benchmarks demonstrate the superiority of our tailored approach, underscoring the importance of advanced question comprehension in VideoQA.

Supplemental Material

MP4 File
We introduce the research background and discuss the definition and framework of video question answering. Next, we review related work and identify the challenges in the current field of video question answering. Then, we propose an innovative multi-modal and multi-layer question enhancement network, emphasizing nuanced question understanding. We provide detailed descriptions of the feature extraction, interaction, and decoding processes, and showcase the experimental results of this method on three benchmark datasets to demonstrate its superiority. The significance of advanced question understanding in VideoQA is specifically emphasized.

References

[1]
Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019a. Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1999--2007.
[2]
Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019b. Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1999--2007.
[3]
Pengfei Fang, Jieming Zhou, Soumava Kumar Roy, Pan Ji, Lars Petersson, and Mehrtash Harandi. 2021. Attention in attention networks for person retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 9 (2021), 4626--4641.
[4]
Lianli Gao, Pengpeng Zeng, Jingkuan Song, Yuan-Fang Li, Wu Liu, Tao Mei, and Heng Tao Shen. 2019. Structured two-stream attention network for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6391--6398.
[5]
Yuanyuan Ge, Youjiang Xu, and Yahong Han. 2017. Video question answering using a forget memory network. In Chinese Conference on Computer Vision. 404--415.
[6]
Ronghang Hu, Amanpreet Singh, Trevor Darrell, and Marcus Rohrbach. 2020. Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9992--10002.
[7]
Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2758--2766.
[8]
Pin Jiang and Yahong Han. 2020. Reasoning with heterogeneous graph alignment for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11109--11116.
[9]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017), 1--22.
[10]
Kyung-Min Kim, Seong-Ho Choi, Jin-Hwa Kim, and Byoung-Tak Zhang. 2018. Multimodal dual attention memory for video story question answering. In Proceedings of the European Conference on Computer Vision. 673--688.
[11]
Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, and Chuang Gan. 2019. Beyond rnns: Positional self-attention with co-attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8658--8665.
[12]
Xinrui Li, Aming Wu, and Yahong Han. 2022. Complementary spatiotemporal network for video question answering. Multimedia Systems, Vol. 28, 1 (2022), 161--169.
[13]
Fei Liu, Jing Liu, Weining Wang, and Hanqing Lu. 2021. Hair: Hierarchical visual-semantic relational reasoning for video question answering. In Proceedings of the IEEE International Conference on Computer Vision. 1698--1707.
[14]
Mingyang Liu, Ruomei Wang, Fan Zhou, and Ge Lin. 2022. Temporally multi-modal semantic reasoning with spatial language constraints for video question answering. Symmetry, Vol. 14, 6 (2022), 1133.
[15]
Liqiang Nie, Leigang Qu, Dai Meng, Min Zhang, Qi Tian, and Alberto Del Bimbo. 2022. Search-oriented micro-video captioning. In Proceedings of the ACM International Conference on Multimedia. 3234--3243.
[16]
Min Peng, Chongyang Wang, Yuan Gao, Yu Shi, and Xiang-Dong Zhou. 2022. Multilevel hierarchical network with multiscale sampling for video question answering. In Proceedings of the International Joint Conference on Artificial Intelligence. 1276--1282.
[17]
Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic modality interaction modeling for image-text retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 1104--1113.
[18]
Ahjeong Seo, Gi-Cheon Kang, Joonhan Park, and Byoung-Tak Zhang. 2021. Attend what you need: Motion-appearance synergistic networks for video question answering. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. 6167--6177.
[19]
Aisha Urooj, Amir Mazaheri, Mubarak Shah, et al. 2020. MMFT-BERT: Multimodal fusion transformer with BERT encodings for visual question answering. In Findings of the Association for Computational Linguistics: EMNLP. 4648--4660.
[20]
Tian Wang, Boyao Hou, Jiakun Li, Peng Shi, Baochang Zhang, and Hichem Snoussi. 2023 a. TASTA: Text-assisted spatial and temporal attention network for video question answering. Advanced Intelligent Systems (2023), 2200131.
[21]
Yuanyuan Wang, Meng Liu, Jianlong Wu, and Liqiang Nie. 2023 b. Multi-granularity interaction and integration network for video question answering. IEEE Transactions on Circuits and Systems for Video Technology (2023), 1--13.
[22]
Junbin Xiao, Angela Yao, Zhiyuan Liu, Yicong Li, Wei Ji, and Tat-Seng Chua. 2022a. Video as conditional graph hierarchy for multi-granular question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2804--2812.
[23]
Junbin Xiao, Pan Zhou, Tat-Seng Chua, and Shuicheng Yan. 2022b. Video graph transformer for video question answering. In European Conference on Computer Vision. 39--58.
[24]
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the ACM international conference on Multimedia. 1645--1653.
[25]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288--5296.
[26]
Zenan Xu, Wanjun Zhong, Qinliang Su, Zijing Ou, and Fuwei Zhang. 2022. Modeling semantic composition with syntactic hypergraph for video question answering. arXiv preprint arXiv:2205.06530 (2022), 1--11.
[27]
Hongyang Xue, Zhou Zhao, and Deng Cai. 2017. Unifying the video and question attentions for open-ended video question answering. IEEE Transactions on Image Processing, Vol. 26, 12 (2017), 5656--5666.
[28]
Shuhong Ye, Weikai Kong, Chenglin Yao, Jianfeng Ren, and Xudong Jiang. 2023. Video question answering using CLIP-guided visual-text attention. arXiv preprint arXiv:2303.03131 (2023), 1--5.
[29]
Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3165--3173.
[30]
Zhaoquan Yuan, Siyuan Sun, Lixin Duan, Xiao Wu, and Changsheng Xu. 2021. Adversarial multimodal network for movie story question answering. IEEE Transactions on Multimedia, Vol. 23 (2021), 1744--1756.
[31]
Abdulganiyu Abdu Yusuf, Feng Chong, and Mao Xianling. 2022. An analysis of graph convolutional networks and recent datasets for visual question answering. Artificial Intelligence Review, Vol. 55, 8 (2022), 6277--6300.
[32]
Zhou Zhao, Jinghao Lin, Xinghua Jiang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017a. Video question answering via hierarchical dual-level attention network learning. In Proceedings of the ACM International Conference on Multimedia. 1050--1058.
[33]
Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017b. Video question answering via hierarchical spatio-temporal attention networks. In Proceedings of the International Joint Conference on Artificial Intelligence. 3518--3524.
[34]
Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G Hauptmann. 2017. Uncovering the temporal context for video question answering. International Journal of Computer Vision, Vol. 124 (2017), 409--421.

Cited By

View all
  • (2025)Collaborative Aware Bidirectional Semantic Reasoning for Video Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.349066535:3(2074-2086)Online publication date: Mar-2025
  • (2024)Harnessing Representative Spatial-Temporal Information for Video Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367539920:10(1-20)Online publication date: 5-Jul-2024

Index Terms

  1. Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement Network

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. multi-layer question encoding
    2. multi-modal question enhancement
    3. video question answering

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • Special Fund for Distinguished Professors of Shandong Jianzhu University
    • Shenzhen College Stability Support Plan
    • Defense Advanced Research Projects Agency (DARPA)

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)70
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Collaborative Aware Bidirectional Semantic Reasoning for Video Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.349066535:3(2074-2086)Online publication date: Mar-2025
    • (2024)Harnessing Representative Spatial-Temporal Information for Video Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367539920:10(1-20)Online publication date: 5-Jul-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media