skip to main content
10.1145/3474085.3475620acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Pairwise VLAD Interaction Network for Video Question Answering

Published: 17 October 2021 Publication History

Abstract

Video Question Answering (VideoQA) is a challenging problem, as it requires a joint understanding of video and natural language question. Existing methods perform correlation learning between video and question have achieved great success. However, previous methods merely model relations between individual video frames (or clips) and words, which are not enough to correctly answer the question. From human's perspective, answering a video question should first summarize both visual and language information, and then explore their correlations for answer reasoning. In this paper, we propose a new method called Pairwise VLAD Interaction Network (PVI-Net) to address this problem. Specifically, we develop a learnable clustering-based VLAD encoder to respectively summarize video and question modalities into a small number of compact VLAD descriptors. For correlation learning, a pairwise VLAD interaction mechanism is proposed to better exploit complementary information for each pair of modality descriptors, avoiding modeling uninformative individual relations (e.g., frame-word and clip-word relations), and exploring both inter- and intra-modality relations simultaneously. Experimental results show that our approach achieves state-of-the-art performance on three VideoQA datasets: TGIF-QA, MSVD-QA, and MSRVTT-QA.

References

[1]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In ICCV. 2425--2433.
[2]
Jiayin Cai, Chun Yuan, Cheng Shi, Lei Li, Yangyang Cheng, and Ying Shan. 2020. Feature Augmented Memory with Global Attention Network for VideoQA. In IJCAI. 998--1004.
[3]
François Chollet. 2017. Xception: Deep Learning with Depthwise Separable Convolutions. In CVPR. 1800--1807.
[4]
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In CVPR. 1080--1089.
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL. 4171--4186.
[6]
Matthijs Douze, Jérôme Revaud, Cordelia Schmid, and Hervé Jégou. 2013. Stable Hyper-pooling and Query Expansion for Event Detection. In ICCV. 1825--1832.
[7]
Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019. Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering. In CVPR. 1999--2007.
[8]
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In EMNLP. 457--468.
[9]
Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. 2018. Motion-Appearance Co-Memory Networks for Video Question Answering. In CVPR. 6576--6585.
[10]
Lianli Gao, Pengpeng Zeng, Jingkuan Song, Yuan-Fang Li, Wu Liu, Tao Mei, and Heng Tao Shen. 2019. Structured Two-Stream Attention Network for Video Question Answering. In AAAI. 6391-- 6398.
[11]
Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan C. Russell. 2017. ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification. In CVPR. 3165--3174.
[12]
Dan Guo, Hui Wang, and Meng Wang. 2019. Dual Visual Attention Network for Visual Dialog. In IJCAI. 4989--4995.
[13]
Dan Guo, Hui Wang, Hanwang Zhang, Zheng-Jun Zha, and Meng Wang. 2020. Iterative Context-Aware Graph Inference for Visual Dialog. In CVPR. 10052--10061.
[14]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?. In CVPR. 6546--6555.
[15]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 770--778.
[16]
Deng Huang, Peihao Chen, Runhao Zeng, Qing Du, Mingkui Tan, and Chuang Gan. 2020. Location-Aware Graph Convolutional Networks for Video Question Answering. In AAAI. 11021--11028.
[17]
Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. 2016. A Hierarchical Deep Temporal Model for Group Activity Recognition. In CVPR. 1971--1980.
[18]
Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering. In CVPR. 1359--1367.
[19]
Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In CVPR. 3304--3311.
[20]
Pin Jiang and Yahong Han. 2020. Reasoning with Heterogeneous Graph Alignment for Video Question Answering. In AAAI. 11109-- 11116.
[21]
Weike Jin, Zhou Zhao, Mao Gu, Jun Yu, Jun Xiao, and Yueting Zhuang. 2019. Multi-interaction Network with Object Relation for Video Question Answering. In ACM MM. 1193--1201.
[22]
Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, and Marcus Rohrbach. 2019. CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog. In NAACL. 582--595.
[23]
Thao Minh Le, Vuong Le, Svetha Venkatesh, and Truyen Tran. 2020. Hierarchical Conditional Relation Networks for Video Question Answering. In CVPR. 9968--9978.
[24]
Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L. Berg. 2018. TVQA: Localized, Compositional Video Question Answering. In EMNLP. 1369--1379.
[25]
Xiangpeng Li, Lianli Gao, Xuanhan Wang, Wu Liu, Xing Xu, Heng Tao Shen, and Jingkuan Song. 2019. Learnable Aggregating Net with Diversity Learning for Video Question Answering. In ACM MM. 1166--1174.
[26]
Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu,Wenbing Huang, Xiangnan He, and Chuang Gan. 2019. Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering. In AAAI. 8658--8665.
[27]
Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. In ECCV. 3--21.
[28]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP. 1532--1543.
[29]
Florent Perronnin and Christopher R. Dance. 2007. Fisher Kernels on Visual Vocabularies for Image Categorization. In CVPR. 1--1.
[30]
Mengye Ren, Ryan Kiros, and Richard S. Zemel. 2015. Exploring Models and Data for Image Question Answering. In NeurIPS. 2953--2961.
[31]
Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Joseph Pal, Hugo Larochelle, Aaron C. Courville, and Bernt Schiele. 2017. Movie Description. International Journal of Computer Vision 123, 1 (2017), 94--120.
[32]
Josef Sivic and Andrew Zisserman. 2003. Video Google: A Text Retrieval Approach to Object Matching in Videos. In ICCV. 1470--1477.
[33]
Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. MovieQA: Understanding Stories in Movies through Question-Answering. In CVPR. 4631--4640.
[34]
Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In ICCV. 4489--4497.
[35]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS. 5998--6008.
[36]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In ECCV. 20--36.
[37]
Xinyu Wang, Yuliang Liu, Chunhua Shen, Chun Chet Ng, Canjie Luo, Lianwen Jin, Chee Seng Chan, Anton van den Hengel, and Liangwei Wang. 2020. On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering. In CVPR. 10123-- 10132.
[38]
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video Question Answering via Gradually Refined Attention over Appearance and Motion. In ACM MM. 1645--1653.
[39]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR. 5288--5296.
[40]
Youjiang Xu, Yahong Han, Richang Hong, and Qi Tian. 2018. Sequential Video VLAD: Training the Aggregation Locally and Temporally. IEEE Transactions on Image Processing 27, 10 (2018), 4933--4944.
[41]
Yunan Ye, Zhou Zhao, Yimeng Li, Long Chen, Jun Xiao, and Yueting Zhuang. 2017. Video Question Answering via Attribute- Augmented Attention Network Learning. In SIGIR. 829--832.
[42]
Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering. In CVPR. 3261--3269.
[43]
Zheng-Jun Zha, Jiawei Liu, Tianhao Yang, and Yongdong Zhang. 2019. Spatiotemporal-Textual Co-Attention Network for Video Question Answering. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s (2019), 1--18.
[44]
J. Zhang and Y. Peng. 2020. Video Captioning With Object-Aware Spatio-Temporal Correlation and Aggregation. IEEE Transactions on Image Processing 29 (2020), 6209--6222.
[45]
J. Zhang, J. Shao, R. Cao, L. Gao, X. Xu, and H. T. Shen. 2020. Action-Centric Relation Transformer Network for Video Question Answering. IEEE Transactions on Circuits and Systems for Video Technology (2020), 1--1.
[46]
Wenqiao Zhang, Siliang Tang, Yanpeng Cao, Shiliang Pu, Fei Wu, and Yueting Zhuang. 2020. Frame Augmented Alternating Attention Network for Video Question Answering. IEEE Transactions on Multimedia 22, 4 (2020), 1032--1041.
[47]
Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017. Video Question Answering via Hierarchical Spatio-Temporal Attention Networks. In IJCAI. 3518--3524.

Cited By

View all
  • (2024)So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question AnsweringIEEE Transactions on Systems, Man, and Cybernetics: Systems10.1109/TSMC.2023.331996454:2(854-865)Online publication date: Feb-2024
  • (2024)Hybrid Graph Reasoning With Dynamic Interaction for Visual DialogIEEE Transactions on Multimedia10.1109/TMM.2024.338599726(9095-9108)Online publication date: 2024
  • (2024)Multi-Granularity Relational Attention Network for Audio-Visual Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.326452434:8(7080-7094)Online publication date: Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. pairwise interaction
  2. video question answering
  3. vlad

Qualifiers

  • Research-article

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)6
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question AnsweringIEEE Transactions on Systems, Man, and Cybernetics: Systems10.1109/TSMC.2023.331996454:2(854-865)Online publication date: Feb-2024
  • (2024)Hybrid Graph Reasoning With Dynamic Interaction for Visual DialogIEEE Transactions on Multimedia10.1109/TMM.2024.338599726(9095-9108)Online publication date: 2024
  • (2024)Multi-Granularity Relational Attention Network for Audio-Visual Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.326452434:8(7080-7094)Online publication date: Aug-2024
  • (2024)Hierarchical synchronization with structured multi-granularity interaction for video question answeringNeurocomputing10.1016/j.neucom.2024.127494582:COnline publication date: 14-May-2024
  • (2024)Video Q &A based on two-stage deep exploration of temporally-evolving features with enhanced cross-modal attention mechanismNeural Computing and Applications10.1007/s00521-024-09482-836:14(8055-8071)Online publication date: 27-Feb-2024
  • (2023)Transformer-Based Visual Grounding with Cross-Modality InteractionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358725119:6(1-19)Online publication date: 9-Mar-2023
  • (2023)Language-Guided Visual Aggregation Network for Video Question AnsweringProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613909(5195-5203)Online publication date: 26-Oct-2023
  • (2023)ERM: Energy-Based Refined-Attention Mechanism for Video Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.321246333:3(1454-1467)Online publication date: 1-Mar-2023
  • (2023)Two-Stream Heterogeneous Graph Network with Dynamic Interactive Learning for Video Question Answering2023 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN54540.2023.10191238(1-8)Online publication date: 18-Jun-2023
  • (2023)Confidence-Based Event-Centric Online Video Question Answering on a Newly Constructed ATBS DatasetICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10095044(1-5)Online publication date: 4-Jun-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media