ABSTRACT
Video question answering is an increasingly vital research field, spurred by the rapid proliferation of video content online and the urgent need for intelligent systems that can comprehend and interact with this content. Existing methodologies often lean towards video understanding and cross-modal information interaction modeling but tend to overlook the crucial aspect of comprehensive question understanding. To address this gap, we introduce the multi-modal and multi-layer question enhancement network, a groundbreaking framework emphasizing nuanced question understanding. Our approach begins by extracting object, appearance, and motion features from videos. Subsequently, we harness multi-layer outputs from a pre-trained language model, ensuring a thorough grasp of the question. Integrating object data into appearance is guided by global question and frame representation, facilitating the adaptive acquisition of appearance and motion-enhanced question representation. By amalgamating multi-modal question insights, our methodology adeptly determines answers to questions. Experimental results conducted on three benchmarks demonstrate the superiority of our tailored approach, underscoring the importance of advanced question comprehension in VideoQA.
Supplemental Material
- Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019a. Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1999--2007.Google ScholarCross Ref
- Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019b. Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1999--2007.Google ScholarCross Ref
- Pengfei Fang, Jieming Zhou, Soumava Kumar Roy, Pan Ji, Lars Petersson, and Mehrtash Harandi. 2021. Attention in attention networks for person retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 9 (2021), 4626--4641.Google Scholar
- Lianli Gao, Pengpeng Zeng, Jingkuan Song, Yuan-Fang Li, Wu Liu, Tao Mei, and Heng Tao Shen. 2019. Structured two-stream attention network for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6391--6398.Google ScholarDigital Library
- Yuanyuan Ge, Youjiang Xu, and Yahong Han. 2017. Video question answering using a forget memory network. In Chinese Conference on Computer Vision. 404--415.Google ScholarCross Ref
- Ronghang Hu, Amanpreet Singh, Trevor Darrell, and Marcus Rohrbach. 2020. Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9992--10002.Google ScholarCross Ref
- Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2758--2766.Google ScholarCross Ref
- Pin Jiang and Yahong Han. 2020. Reasoning with heterogeneous graph alignment for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11109--11116.Google ScholarCross Ref
- Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017), 1--22.Google Scholar
- Kyung-Min Kim, Seong-Ho Choi, Jin-Hwa Kim, and Byoung-Tak Zhang. 2018. Multimodal dual attention memory for video story question answering. In Proceedings of the European Conference on Computer Vision. 673--688.Google ScholarDigital Library
- Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, and Chuang Gan. 2019. Beyond rnns: Positional self-attention with co-attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8658--8665.Google ScholarDigital Library
- Xinrui Li, Aming Wu, and Yahong Han. 2022. Complementary spatiotemporal network for video question answering. Multimedia Systems, Vol. 28, 1 (2022), 161--169.Google ScholarDigital Library
- Fei Liu, Jing Liu, Weining Wang, and Hanqing Lu. 2021. Hair: Hierarchical visual-semantic relational reasoning for video question answering. In Proceedings of the IEEE International Conference on Computer Vision. 1698--1707.Google ScholarCross Ref
- Mingyang Liu, Ruomei Wang, Fan Zhou, and Ge Lin. 2022. Temporally multi-modal semantic reasoning with spatial language constraints for video question answering. Symmetry, Vol. 14, 6 (2022), 1133.Google ScholarCross Ref
- Liqiang Nie, Leigang Qu, Dai Meng, Min Zhang, Qi Tian, and Alberto Del Bimbo. 2022. Search-oriented micro-video captioning. In Proceedings of the ACM International Conference on Multimedia. 3234--3243.Google ScholarDigital Library
- Min Peng, Chongyang Wang, Yuan Gao, Yu Shi, and Xiang-Dong Zhou. 2022. Multilevel hierarchical network with multiscale sampling for video question answering. In Proceedings of the International Joint Conference on Artificial Intelligence. 1276--1282.Google ScholarCross Ref
- Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic modality interaction modeling for image-text retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 1104--1113.Google ScholarDigital Library
- Ahjeong Seo, Gi-Cheon Kang, Joonhan Park, and Byoung-Tak Zhang. 2021. Attend what you need: Motion-appearance synergistic networks for video question answering. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. 6167--6177.Google ScholarCross Ref
- Aisha Urooj, Amir Mazaheri, Mubarak Shah, et al. 2020. MMFT-BERT: Multimodal fusion transformer with BERT encodings for visual question answering. In Findings of the Association for Computational Linguistics: EMNLP. 4648--4660.Google Scholar
- Tian Wang, Boyao Hou, Jiakun Li, Peng Shi, Baochang Zhang, and Hichem Snoussi. 2023 a. TASTA: Text-assisted spatial and temporal attention network for video question answering. Advanced Intelligent Systems (2023), 2200131.Google Scholar
- Yuanyuan Wang, Meng Liu, Jianlong Wu, and Liqiang Nie. 2023 b. Multi-granularity interaction and integration network for video question answering. IEEE Transactions on Circuits and Systems for Video Technology (2023), 1--13.Google Scholar
- Junbin Xiao, Angela Yao, Zhiyuan Liu, Yicong Li, Wei Ji, and Tat-Seng Chua. 2022a. Video as conditional graph hierarchy for multi-granular question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2804--2812.Google ScholarCross Ref
- Junbin Xiao, Pan Zhou, Tat-Seng Chua, and Shuicheng Yan. 2022b. Video graph transformer for video question answering. In European Conference on Computer Vision. 39--58.Google ScholarDigital Library
- Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the ACM international conference on Multimedia. 1645--1653.Google ScholarDigital Library
- Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288--5296.Google ScholarCross Ref
- Zenan Xu, Wanjun Zhong, Qinliang Su, Zijing Ou, and Fuwei Zhang. 2022. Modeling semantic composition with syntactic hypergraph for video question answering. arXiv preprint arXiv:2205.06530 (2022), 1--11.Google Scholar
- Hongyang Xue, Zhou Zhao, and Deng Cai. 2017. Unifying the video and question attentions for open-ended video question answering. IEEE Transactions on Image Processing, Vol. 26, 12 (2017), 5656--5666.Google ScholarDigital Library
- Shuhong Ye, Weikai Kong, Chenglin Yao, Jianfeng Ren, and Xudong Jiang. 2023. Video question answering using CLIP-guided visual-text attention. arXiv preprint arXiv:2303.03131 (2023), 1--5.Google Scholar
- Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3165--3173.Google ScholarCross Ref
- Zhaoquan Yuan, Siyuan Sun, Lixin Duan, Xiao Wu, and Changsheng Xu. 2021. Adversarial multimodal network for movie story question answering. IEEE Transactions on Multimedia, Vol. 23 (2021), 1744--1756.Google ScholarDigital Library
- Abdulganiyu Abdu Yusuf, Feng Chong, and Mao Xianling. 2022. An analysis of graph convolutional networks and recent datasets for visual question answering. Artificial Intelligence Review, Vol. 55, 8 (2022), 6277--6300.Google ScholarDigital Library
- Zhou Zhao, Jinghao Lin, Xinghua Jiang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017a. Video question answering via hierarchical dual-level attention network learning. In Proceedings of the ACM International Conference on Multimedia. 1050--1058.Google ScholarDigital Library
- Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017b. Video question answering via hierarchical spatio-temporal attention networks. In Proceedings of the International Joint Conference on Artificial Intelligence. 3518--3524.Google ScholarCross Ref
- Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G Hauptmann. 2017. Uncovering the temporal context for video question answering. International Journal of Computer Vision, Vol. 124 (2017), 409--421.Google ScholarDigital Library
Index Terms
- Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement Network
Recommendations
Keyword-aware Multi-modal Enhancement Attention for Video Question Answering
CSAI '21: Proceedings of the 2021 5th International Conference on Computer Science and Artificial IntelligenceVideo question answering (VideoQA) is an intriguing topic in the field of visual language. Most of the current VideoQA models directly harness the global video information to answer questions. However, in VideoQA task, the answers associated with the ...
Question Difficulty Estimation with Directional Modality Association in Video Question Answering
Advances and Trends in Artificial Intelligence. Theory and Practices in Artificial IntelligenceAbstractThe questions in question-answering (QA) tasks have a different level of difficulty. Thus, a number of methods have been proposed to estimate a difficulty level of the questions. However, the existing methods estimate the difficulty based only on ...
Comments