skip to main content
10.1145/3581783.3612239acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement Network

Published:27 October 2023Publication History

ABSTRACT

Video question answering is an increasingly vital research field, spurred by the rapid proliferation of video content online and the urgent need for intelligent systems that can comprehend and interact with this content. Existing methodologies often lean towards video understanding and cross-modal information interaction modeling but tend to overlook the crucial aspect of comprehensive question understanding. To address this gap, we introduce the multi-modal and multi-layer question enhancement network, a groundbreaking framework emphasizing nuanced question understanding. Our approach begins by extracting object, appearance, and motion features from videos. Subsequently, we harness multi-layer outputs from a pre-trained language model, ensuring a thorough grasp of the question. Integrating object data into appearance is guided by global question and frame representation, facilitating the adaptive acquisition of appearance and motion-enhanced question representation. By amalgamating multi-modal question insights, our methodology adeptly determines answers to questions. Experimental results conducted on three benchmarks demonstrate the superiority of our tailored approach, underscoring the importance of advanced question comprehension in VideoQA.

Skip Supplemental Material Section

Supplemental Material

mmfp2361-video.mp4

mp4

126.1 MB

References

  1. Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019a. Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1999--2007.Google ScholarGoogle ScholarCross RefCross Ref
  2. Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019b. Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1999--2007.Google ScholarGoogle ScholarCross RefCross Ref
  3. Pengfei Fang, Jieming Zhou, Soumava Kumar Roy, Pan Ji, Lars Petersson, and Mehrtash Harandi. 2021. Attention in attention networks for person retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 9 (2021), 4626--4641.Google ScholarGoogle Scholar
  4. Lianli Gao, Pengpeng Zeng, Jingkuan Song, Yuan-Fang Li, Wu Liu, Tao Mei, and Heng Tao Shen. 2019. Structured two-stream attention network for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6391--6398.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Yuanyuan Ge, Youjiang Xu, and Yahong Han. 2017. Video question answering using a forget memory network. In Chinese Conference on Computer Vision. 404--415.Google ScholarGoogle ScholarCross RefCross Ref
  6. Ronghang Hu, Amanpreet Singh, Trevor Darrell, and Marcus Rohrbach. 2020. Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9992--10002.Google ScholarGoogle ScholarCross RefCross Ref
  7. Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2758--2766.Google ScholarGoogle ScholarCross RefCross Ref
  8. Pin Jiang and Yahong Han. 2020. Reasoning with heterogeneous graph alignment for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11109--11116.Google ScholarGoogle ScholarCross RefCross Ref
  9. Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017), 1--22.Google ScholarGoogle Scholar
  10. Kyung-Min Kim, Seong-Ho Choi, Jin-Hwa Kim, and Byoung-Tak Zhang. 2018. Multimodal dual attention memory for video story question answering. In Proceedings of the European Conference on Computer Vision. 673--688.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, and Chuang Gan. 2019. Beyond rnns: Positional self-attention with co-attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8658--8665.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Xinrui Li, Aming Wu, and Yahong Han. 2022. Complementary spatiotemporal network for video question answering. Multimedia Systems, Vol. 28, 1 (2022), 161--169.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Fei Liu, Jing Liu, Weining Wang, and Hanqing Lu. 2021. Hair: Hierarchical visual-semantic relational reasoning for video question answering. In Proceedings of the IEEE International Conference on Computer Vision. 1698--1707.Google ScholarGoogle ScholarCross RefCross Ref
  14. Mingyang Liu, Ruomei Wang, Fan Zhou, and Ge Lin. 2022. Temporally multi-modal semantic reasoning with spatial language constraints for video question answering. Symmetry, Vol. 14, 6 (2022), 1133.Google ScholarGoogle ScholarCross RefCross Ref
  15. Liqiang Nie, Leigang Qu, Dai Meng, Min Zhang, Qi Tian, and Alberto Del Bimbo. 2022. Search-oriented micro-video captioning. In Proceedings of the ACM International Conference on Multimedia. 3234--3243.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Min Peng, Chongyang Wang, Yuan Gao, Yu Shi, and Xiang-Dong Zhou. 2022. Multilevel hierarchical network with multiscale sampling for video question answering. In Proceedings of the International Joint Conference on Artificial Intelligence. 1276--1282.Google ScholarGoogle ScholarCross RefCross Ref
  17. Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic modality interaction modeling for image-text retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 1104--1113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ahjeong Seo, Gi-Cheon Kang, Joonhan Park, and Byoung-Tak Zhang. 2021. Attend what you need: Motion-appearance synergistic networks for video question answering. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. 6167--6177.Google ScholarGoogle ScholarCross RefCross Ref
  19. Aisha Urooj, Amir Mazaheri, Mubarak Shah, et al. 2020. MMFT-BERT: Multimodal fusion transformer with BERT encodings for visual question answering. In Findings of the Association for Computational Linguistics: EMNLP. 4648--4660.Google ScholarGoogle Scholar
  20. Tian Wang, Boyao Hou, Jiakun Li, Peng Shi, Baochang Zhang, and Hichem Snoussi. 2023 a. TASTA: Text-assisted spatial and temporal attention network for video question answering. Advanced Intelligent Systems (2023), 2200131.Google ScholarGoogle Scholar
  21. Yuanyuan Wang, Meng Liu, Jianlong Wu, and Liqiang Nie. 2023 b. Multi-granularity interaction and integration network for video question answering. IEEE Transactions on Circuits and Systems for Video Technology (2023), 1--13.Google ScholarGoogle Scholar
  22. Junbin Xiao, Angela Yao, Zhiyuan Liu, Yicong Li, Wei Ji, and Tat-Seng Chua. 2022a. Video as conditional graph hierarchy for multi-granular question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2804--2812.Google ScholarGoogle ScholarCross RefCross Ref
  23. Junbin Xiao, Pan Zhou, Tat-Seng Chua, and Shuicheng Yan. 2022b. Video graph transformer for video question answering. In European Conference on Computer Vision. 39--58.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the ACM international conference on Multimedia. 1645--1653.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288--5296.Google ScholarGoogle ScholarCross RefCross Ref
  26. Zenan Xu, Wanjun Zhong, Qinliang Su, Zijing Ou, and Fuwei Zhang. 2022. Modeling semantic composition with syntactic hypergraph for video question answering. arXiv preprint arXiv:2205.06530 (2022), 1--11.Google ScholarGoogle Scholar
  27. Hongyang Xue, Zhou Zhao, and Deng Cai. 2017. Unifying the video and question attentions for open-ended video question answering. IEEE Transactions on Image Processing, Vol. 26, 12 (2017), 5656--5666.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Shuhong Ye, Weikai Kong, Chenglin Yao, Jianfeng Ren, and Xudong Jiang. 2023. Video question answering using CLIP-guided visual-text attention. arXiv preprint arXiv:2303.03131 (2023), 1--5.Google ScholarGoogle Scholar
  29. Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3165--3173.Google ScholarGoogle ScholarCross RefCross Ref
  30. Zhaoquan Yuan, Siyuan Sun, Lixin Duan, Xiao Wu, and Changsheng Xu. 2021. Adversarial multimodal network for movie story question answering. IEEE Transactions on Multimedia, Vol. 23 (2021), 1744--1756.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Abdulganiyu Abdu Yusuf, Feng Chong, and Mao Xianling. 2022. An analysis of graph convolutional networks and recent datasets for visual question answering. Artificial Intelligence Review, Vol. 55, 8 (2022), 6277--6300.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Zhou Zhao, Jinghao Lin, Xinghua Jiang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017a. Video question answering via hierarchical dual-level attention network learning. In Proceedings of the ACM International Conference on Multimedia. 1050--1058.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017b. Video question answering via hierarchical spatio-temporal attention networks. In Proceedings of the International Joint Conference on Artificial Intelligence. 3518--3524.Google ScholarGoogle ScholarCross RefCross Ref
  34. Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G Hauptmann. 2017. Uncovering the temporal context for video question answering. International Journal of Computer Vision, Vol. 124 (2017), 409--421.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement Network

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '23: Proceedings of the 31st ACM International Conference on Multimedia
      October 2023
      9913 pages
      ISBN:9798400701085
      DOI:10.1145/3581783

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 October 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia
    • Article Metrics

      • Downloads (Last 12 months)122
      • Downloads (Last 6 weeks)17

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader