research-article

Pairwise VLAD Interaction Network for Video Question Answering

Authors:

Xian-Sheng Hua,

Meng WangAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 5119 - 5127

https://doi.org/10.1145/3474085.3475620

Published: 17 October 2021 Publication History

Abstract

Video Question Answering (VideoQA) is a challenging problem, as it requires a joint understanding of video and natural language question. Existing methods perform correlation learning between video and question have achieved great success. However, previous methods merely model relations between individual video frames (or clips) and words, which are not enough to correctly answer the question. From human's perspective, answering a video question should first summarize both visual and language information, and then explore their correlations for answer reasoning. In this paper, we propose a new method called Pairwise VLAD Interaction Network (PVI-Net) to address this problem. Specifically, we develop a learnable clustering-based VLAD encoder to respectively summarize video and question modalities into a small number of compact VLAD descriptors. For correlation learning, a pairwise VLAD interaction mechanism is proposed to better exploit complementary information for each pair of modality descriptors, avoiding modeling uninformative individual relations (e.g., frame-word and clip-word relations), and exploring both inter- and intra-modality relations simultaneously. Experimental results show that our approach achieves state-of-the-art performance on three VideoQA datasets: TGIF-QA, MSVD-QA, and MSRVTT-QA.

References

[1]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In ICCV. 2425--2433.

Digital Library

[2]

Jiayin Cai, Chun Yuan, Cheng Shi, Lei Li, Yangyang Cheng, and Ying Shan. 2020. Feature Augmented Memory with Global Attention Network for VideoQA. In IJCAI. 998--1004.

[3]

François Chollet. 2017. Xception: Deep Learning with Depthwise Separable Convolutions. In CVPR. 1800--1807.

[4]

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In CVPR. 1080--1089.

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL. 4171--4186.

[6]

Matthijs Douze, Jérôme Revaud, Cordelia Schmid, and Hervé Jégou. 2013. Stable Hyper-pooling and Query Expansion for Event Detection. In ICCV. 1825--1832.

Digital Library

[7]

Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019. Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering. In CVPR. 1999--2007.

[8]

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In EMNLP. 457--468.

[9]

Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. 2018. Motion-Appearance Co-Memory Networks for Video Question Answering. In CVPR. 6576--6585.

[10]

Lianli Gao, Pengpeng Zeng, Jingkuan Song, Yuan-Fang Li, Wu Liu, Tao Mei, and Heng Tao Shen. 2019. Structured Two-Stream Attention Network for Video Question Answering. In AAAI. 6391-- 6398.

[11]

Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan C. Russell. 2017. ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification. In CVPR. 3165--3174.

[12]

Dan Guo, Hui Wang, and Meng Wang. 2019. Dual Visual Attention Network for Visual Dialog. In IJCAI. 4989--4995.

[13]

Dan Guo, Hui Wang, Hanwang Zhang, Zheng-Jun Zha, and Meng Wang. 2020. Iterative Context-Aware Graph Inference for Visual Dialog. In CVPR. 10052--10061.

[14]

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?. In CVPR. 6546--6555.

[15]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 770--778.

[16]

Deng Huang, Peihao Chen, Runhao Zeng, Qing Du, Mingkui Tan, and Chuang Gan. 2020. Location-Aware Graph Convolutional Networks for Video Question Answering. In AAAI. 11021--11028.

[17]

Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. 2016. A Hierarchical Deep Temporal Model for Group Activity Recognition. In CVPR. 1971--1980.

[18]

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering. In CVPR. 1359--1367.

[19]

Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In CVPR. 3304--3311.

[20]

Pin Jiang and Yahong Han. 2020. Reasoning with Heterogeneous Graph Alignment for Video Question Answering. In AAAI. 11109-- 11116.

[21]

Weike Jin, Zhou Zhao, Mao Gu, Jun Yu, Jun Xiao, and Yueting Zhuang. 2019. Multi-interaction Network with Object Relation for Video Question Answering. In ACM MM. 1193--1201.

Digital Library

[22]

Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, and Marcus Rohrbach. 2019. CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog. In NAACL. 582--595.

[23]

Thao Minh Le, Vuong Le, Svetha Venkatesh, and Truyen Tran. 2020. Hierarchical Conditional Relation Networks for Video Question Answering. In CVPR. 9968--9978.

[24]

Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L. Berg. 2018. TVQA: Localized, Compositional Video Question Answering. In EMNLP. 1369--1379.

[25]

Xiangpeng Li, Lianli Gao, Xuanhan Wang, Wu Liu, Xing Xu, Heng Tao Shen, and Jingkuan Song. 2019. Learnable Aggregating Net with Diversity Learning for Video Question Answering. In ACM MM. 1166--1174.

Digital Library

[26]

Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu,Wenbing Huang, Xiangnan He, and Chuang Gan. 2019. Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering. In AAAI. 8658--8665.

[27]

Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. In ECCV. 3--21.

[28]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP. 1532--1543.

[29]

Florent Perronnin and Christopher R. Dance. 2007. Fisher Kernels on Visual Vocabularies for Image Categorization. In CVPR. 1--1.

[30]

Mengye Ren, Ryan Kiros, and Richard S. Zemel. 2015. Exploring Models and Data for Image Question Answering. In NeurIPS. 2953--2961.

Digital Library

[31]

Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Joseph Pal, Hugo Larochelle, Aaron C. Courville, and Bernt Schiele. 2017. Movie Description. International Journal of Computer Vision 123, 1 (2017), 94--120.

Digital Library

[32]

Josef Sivic and Andrew Zisserman. 2003. Video Google: A Text Retrieval Approach to Object Matching in Videos. In ICCV. 1470--1477.

Digital Library

[33]

Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. MovieQA: Understanding Stories in Movies through Question-Answering. In CVPR. 4631--4640.

[34]

Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In ICCV. 4489--4497.

Digital Library

[35]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS. 5998--6008.

Digital Library

[36]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In ECCV. 20--36.

[37]

Xinyu Wang, Yuliang Liu, Chunhua Shen, Chun Chet Ng, Canjie Luo, Lianwen Jin, Chee Seng Chan, Anton van den Hengel, and Liangwei Wang. 2020. On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering. In CVPR. 10123-- 10132.

[38]

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video Question Answering via Gradually Refined Attention over Appearance and Motion. In ACM MM. 1645--1653.

Digital Library

[39]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR. 5288--5296.

[40]

Youjiang Xu, Yahong Han, Richang Hong, and Qi Tian. 2018. Sequential Video VLAD: Training the Aggregation Locally and Temporally. IEEE Transactions on Image Processing 27, 10 (2018), 4933--4944.

Digital Library

[41]

Yunan Ye, Zhou Zhao, Yimeng Li, Long Chen, Jun Xiao, and Yueting Zhuang. 2017. Video Question Answering via Attribute- Augmented Attention Network Learning. In SIGIR. 829--832.

Digital Library

[42]

Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering. In CVPR. 3261--3269.

[43]

Zheng-Jun Zha, Jiawei Liu, Tianhao Yang, and Yongdong Zhang. 2019. Spatiotemporal-Textual Co-Attention Network for Video Question Answering. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s (2019), 1--18.

Digital Library

[44]

J. Zhang and Y. Peng. 2020. Video Captioning With Object-Aware Spatio-Temporal Correlation and Aggregation. IEEE Transactions on Image Processing 29 (2020), 6209--6222.

[45]

J. Zhang, J. Shao, R. Cao, L. Gao, X. Xu, and H. T. Shen. 2020. Action-Centric Relation Transformer Network for Video Question Answering. IEEE Transactions on Circuits and Systems for Video Technology (2020), 1--1.

[46]

Wenqiao Zhang, Siliang Tang, Yanpeng Cao, Shiliang Pu, Fei Wu, and Yueting Zhuang. 2020. Frame Augmented Alternating Attention Network for Video Question Answering. IEEE Transactions on Multimedia 22, 4 (2020), 1032--1041.

Digital Library

[47]

Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017. Video Question Answering via Hierarchical Spatio-Temporal Attention Networks. In IJCAI. 3518--3524.

Digital Library

Cited By

Zheng WYan LWang F(2024)So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question AnsweringIEEE Transactions on Systems, Man, and Cybernetics: Systems10.1109/TSMC.2023.331996454:2(854-865)Online publication date: Feb-2024
https://doi.org/10.1109/TSMC.2023.3319964
Du SWang HLi TChen C(2024)Hybrid Graph Reasoning With Dynamic Interaction for Visual DialogIEEE Transactions on Multimedia10.1109/TMM.2024.338599726(9095-9108)Online publication date: 2024
https://doi.org/10.1109/TMM.2024.3385997
Li LJin TLin WJiang HPan WWang JXiao SXia YJiang WZhao Z(2024)Multi-Granularity Relational Attention Network for Audio-Visual Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.326452434:8(7080-7094)Online publication date: Aug-2024
https://doi.org/10.1109/TCSVT.2023.3264524
Show More Cited By

Index Terms

Pairwise VLAD Interaction Network for Video Question Answering
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Question answering

Recommendations

Video Question Answering via Attribute-Augmented Attention Network Learning
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Video Question Answering is a challenging problem in visual information retrieval, which provides the answer to the referenced video content according to the question. However, the existing visual question answering approaches mainly tackle the problem ...
Multi-interaction Network with Object Relation for Video Question Answering
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Video question answering is an important task for testing machine's ability of video understanding. The existing methods normally focus on the combination of recurrent and convolutional neural networks to capture spatial and temporal information of the ...
Video Question Answering with Iterative Video-Text Co-tokenization
Computer Vision – ECCV 2022
Abstract
Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
256
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)6

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zheng WYan LWang F(2024)So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question AnsweringIEEE Transactions on Systems, Man, and Cybernetics: Systems10.1109/TSMC.2023.331996454:2(854-865)Online publication date: Feb-2024
https://doi.org/10.1109/TSMC.2023.3319964
Du SWang HLi TChen C(2024)Hybrid Graph Reasoning With Dynamic Interaction for Visual DialogIEEE Transactions on Multimedia10.1109/TMM.2024.338599726(9095-9108)Online publication date: 2024
https://doi.org/10.1109/TMM.2024.3385997
Li LJin TLin WJiang HPan WWang JXiao SXia YJiang WZhao Z(2024)Multi-Granularity Relational Attention Network for Audio-Visual Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.326452434:8(7080-7094)Online publication date: Aug-2024
https://doi.org/10.1109/TCSVT.2023.3264524
Qi SYang LLi C(2024)Hierarchical synchronization with structured multi-granularity interaction for video question answeringNeurocomputing10.1016/j.neucom.2024.127494582:COnline publication date: 14-May-2024
https://dl.acm.org/doi/10.1016/j.neucom.2024.127494
Luo YWang RZhang FZhou FLiu MFeng J(2024)Video Q &A based on two-stage deep exploration of temporally-evolving features with enhanced cross-modal attention mechanismNeural Computing and Applications10.1007/s00521-024-09482-836:14(8055-8071)Online publication date: 27-Feb-2024
https://dl.acm.org/doi/10.1007/s00521-024-09482-8
Li KLi JGuo DYang XWang M(2023)Transformer-Based Visual Grounding with Cross-Modality InteractionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358725119:6(1-19)Online publication date: 9-Mar-2023
https://dl.acm.org/doi/10.1145/3587251
Liang XWang DWang QWan BAn LHe LEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Language-Guided Visual Aggregation Network for Video Question AnsweringProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613909(5195-5203)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3613909
Zhang FWang RZhou FLuo Y(2023)ERM: Energy-Based Refined-Attention Mechanism for Video Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.321246333:3(1454-1467)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1109/TCSVT.2022.3212463
Peng MShao XShi YZhou X(2023)Two-Stream Heterogeneous Graph Network with Dynamic Interactive Learning for Video Question Answering2023 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN54540.2023.10191238(1-8)Online publication date: 18-Jun-2023
https://doi.org/10.1109/IJCNN54540.2023.10191238
Kong WYe SYao CRen J(2023)Confidence-Based Event-Centric Online Video Question Answering on a Newly Constructed ATBS DatasetICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10095044(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10095044
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten