skip to main content
10.1145/3474085.3475620acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Pairwise VLAD Interaction Network for Video Question Answering

Published:17 October 2021Publication History

ABSTRACT

Video Question Answering (VideoQA) is a challenging problem, as it requires a joint understanding of video and natural language question. Existing methods perform correlation learning between video and question have achieved great success. However, previous methods merely model relations between individual video frames (or clips) and words, which are not enough to correctly answer the question. From human's perspective, answering a video question should first summarize both visual and language information, and then explore their correlations for answer reasoning. In this paper, we propose a new method called Pairwise VLAD Interaction Network (PVI-Net) to address this problem. Specifically, we develop a learnable clustering-based VLAD encoder to respectively summarize video and question modalities into a small number of compact VLAD descriptors. For correlation learning, a pairwise VLAD interaction mechanism is proposed to better exploit complementary information for each pair of modality descriptors, avoiding modeling uninformative individual relations (e.g., frame-word and clip-word relations), and exploring both inter- and intra-modality relations simultaneously. Experimental results show that our approach achieves state-of-the-art performance on three VideoQA datasets: TGIF-QA, MSVD-QA, and MSRVTT-QA.

References

  1. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In ICCV. 2425--2433. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Jiayin Cai, Chun Yuan, Cheng Shi, Lei Li, Yangyang Cheng, and Ying Shan. 2020. Feature Augmented Memory with Global Attention Network for VideoQA. In IJCAI. 998--1004.Google ScholarGoogle Scholar
  3. François Chollet. 2017. Xception: Deep Learning with Depthwise Separable Convolutions. In CVPR. 1800--1807.Google ScholarGoogle Scholar
  4. Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In CVPR. 1080--1089.Google ScholarGoogle Scholar
  5. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL. 4171--4186.Google ScholarGoogle Scholar
  6. Matthijs Douze, Jérôme Revaud, Cordelia Schmid, and Hervé Jégou. 2013. Stable Hyper-pooling and Query Expansion for Event Detection. In ICCV. 1825--1832. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019. Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering. In CVPR. 1999--2007.Google ScholarGoogle Scholar
  8. Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In EMNLP. 457--468.Google ScholarGoogle Scholar
  9. Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. 2018. Motion-Appearance Co-Memory Networks for Video Question Answering. In CVPR. 6576--6585.Google ScholarGoogle Scholar
  10. Lianli Gao, Pengpeng Zeng, Jingkuan Song, Yuan-Fang Li, Wu Liu, Tao Mei, and Heng Tao Shen. 2019. Structured Two-Stream Attention Network for Video Question Answering. In AAAI. 6391-- 6398.Google ScholarGoogle Scholar
  11. Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan C. Russell. 2017. ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification. In CVPR. 3165--3174.Google ScholarGoogle Scholar
  12. Dan Guo, Hui Wang, and Meng Wang. 2019. Dual Visual Attention Network for Visual Dialog. In IJCAI. 4989--4995. Google ScholarGoogle ScholarCross RefCross Ref
  13. Dan Guo, Hui Wang, Hanwang Zhang, Zheng-Jun Zha, and Meng Wang. 2020. Iterative Context-Aware Graph Inference for Visual Dialog. In CVPR. 10052--10061.Google ScholarGoogle Scholar
  14. Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?. In CVPR. 6546--6555.Google ScholarGoogle Scholar
  15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 770--778.Google ScholarGoogle Scholar
  16. Deng Huang, Peihao Chen, Runhao Zeng, Qing Du, Mingkui Tan, and Chuang Gan. 2020. Location-Aware Graph Convolutional Networks for Video Question Answering. In AAAI. 11021--11028.Google ScholarGoogle Scholar
  17. Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. 2016. A Hierarchical Deep Temporal Model for Group Activity Recognition. In CVPR. 1971--1980.Google ScholarGoogle Scholar
  18. Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering. In CVPR. 1359--1367.Google ScholarGoogle Scholar
  19. Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In CVPR. 3304--3311.Google ScholarGoogle Scholar
  20. Pin Jiang and Yahong Han. 2020. Reasoning with Heterogeneous Graph Alignment for Video Question Answering. In AAAI. 11109-- 11116.Google ScholarGoogle Scholar
  21. Weike Jin, Zhou Zhao, Mao Gu, Jun Yu, Jun Xiao, and Yueting Zhuang. 2019. Multi-interaction Network with Object Relation for Video Question Answering. In ACM MM. 1193--1201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, and Marcus Rohrbach. 2019. CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog. In NAACL. 582--595.Google ScholarGoogle Scholar
  23. Thao Minh Le, Vuong Le, Svetha Venkatesh, and Truyen Tran. 2020. Hierarchical Conditional Relation Networks for Video Question Answering. In CVPR. 9968--9978.Google ScholarGoogle Scholar
  24. Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L. Berg. 2018. TVQA: Localized, Compositional Video Question Answering. In EMNLP. 1369--1379.Google ScholarGoogle Scholar
  25. Xiangpeng Li, Lianli Gao, Xuanhan Wang, Wu Liu, Xing Xu, Heng Tao Shen, and Jingkuan Song. 2019. Learnable Aggregating Net with Diversity Learning for Video Question Answering. In ACM MM. 1166--1174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu,Wenbing Huang, Xiangnan He, and Chuang Gan. 2019. Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering. In AAAI. 8658--8665.Google ScholarGoogle Scholar
  27. Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. In ECCV. 3--21.Google ScholarGoogle Scholar
  28. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP. 1532--1543.Google ScholarGoogle Scholar
  29. Florent Perronnin and Christopher R. Dance. 2007. Fisher Kernels on Visual Vocabularies for Image Categorization. In CVPR. 1--1.Google ScholarGoogle Scholar
  30. Mengye Ren, Ryan Kiros, and Richard S. Zemel. 2015. Exploring Models and Data for Image Question Answering. In NeurIPS. 2953--2961. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Joseph Pal, Hugo Larochelle, Aaron C. Courville, and Bernt Schiele. 2017. Movie Description. International Journal of Computer Vision 123, 1 (2017), 94--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Josef Sivic and Andrew Zisserman. 2003. Video Google: A Text Retrieval Approach to Object Matching in Videos. In ICCV. 1470--1477. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. MovieQA: Understanding Stories in Movies through Question-Answering. In CVPR. 4631--4640.Google ScholarGoogle Scholar
  34. Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In ICCV. 4489--4497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS. 5998--6008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In ECCV. 20--36.Google ScholarGoogle Scholar
  37. Xinyu Wang, Yuliang Liu, Chunhua Shen, Chun Chet Ng, Canjie Luo, Lianwen Jin, Chee Seng Chan, Anton van den Hengel, and Liangwei Wang. 2020. On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering. In CVPR. 10123-- 10132.Google ScholarGoogle Scholar
  38. Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video Question Answering via Gradually Refined Attention over Appearance and Motion. In ACM MM. 1645--1653. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR. 5288--5296.Google ScholarGoogle Scholar
  40. Youjiang Xu, Yahong Han, Richang Hong, and Qi Tian. 2018. Sequential Video VLAD: Training the Aggregation Locally and Temporally. IEEE Transactions on Image Processing 27, 10 (2018), 4933--4944.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yunan Ye, Zhou Zhao, Yimeng Li, Long Chen, Jun Xiao, and Yueting Zhuang. 2017. Video Question Answering via Attribute- Augmented Attention Network Learning. In SIGIR. 829--832. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering. In CVPR. 3261--3269.Google ScholarGoogle Scholar
  43. Zheng-Jun Zha, Jiawei Liu, Tianhao Yang, and Yongdong Zhang. 2019. Spatiotemporal-Textual Co-Attention Network for Video Question Answering. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s (2019), 1--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. J. Zhang and Y. Peng. 2020. Video Captioning With Object-Aware Spatio-Temporal Correlation and Aggregation. IEEE Transactions on Image Processing 29 (2020), 6209--6222.Google ScholarGoogle ScholarCross RefCross Ref
  45. J. Zhang, J. Shao, R. Cao, L. Gao, X. Xu, and H. T. Shen. 2020. Action-Centric Relation Transformer Network for Video Question Answering. IEEE Transactions on Circuits and Systems for Video Technology (2020), 1--1.Google ScholarGoogle Scholar
  46. Wenqiao Zhang, Siliang Tang, Yanpeng Cao, Shiliang Pu, Fei Wu, and Yueting Zhuang. 2020. Frame Augmented Alternating Attention Network for Video Question Answering. IEEE Transactions on Multimedia 22, 4 (2020), 1032--1041.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017. Video Question Answering via Hierarchical Spatio-Temporal Attention Networks. In IJCAI. 3518--3524. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Pairwise VLAD Interaction Network for Video Question Answering

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '21: Proceedings of the 29th ACM International Conference on Multimedia
        October 2021
        5796 pages
        ISBN:9781450386517
        DOI:10.1145/3474085

        Copyright © 2021 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 October 2021

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader