skip to main content
research-article

Spatiotemporal-Textual Co-Attention Network for Video Question Answering

Published: 19 July 2019 Publication History

Abstract

Visual Question Answering (VQA) is to provide a natural language answer for a pair of an image or video and a natural language question. Despite recent progress on VQA, existing works primarily focus on image question answering and are suboptimal for video question answering. This article presents a novel Spatiotemporal-Textual Co-Attention Network (STCA-Net) for video question answering. The STCA-Net jointly learns spatially and temporally visual attention on videos as well as textual attention on questions. It concentrates on the essential cues in both visual and textual spaces for answering question, leading to effective question-video representation. In particular, a question-guided attention network is designed to learn question-aware video representation with a spatial-temporal attention module. It concentrates the network on regions of interest within the frames of interest across the entire video. A video-guided attention network is proposed to learn video-aware question representation with a textual attention module, leading to fine-grained understanding of question. The learned video and question representations are used by an answer predictor to generate answers. Extensive experiments on two challenging datasets of video question answering, i.e., MSVD-QA and MSRVTT-QA, have shown the effectiveness of the proposed approach.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 265--283.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6--16.
[3]
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 39--48.
[4]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425--2433.
[5]
Di Chen, Zheng-Jun Zha, Jiawei Liu, Hongtao Xie, and Yongdong Zhang. 2018. Temporal-contextual attention network for video-based person re-identification. In Proceedings of the Pacific Rim Conference on Multimedia. 146--157.
[6]
Wenqing Chu, Hongyang Xue, Zhou Zhao, Deng Cai, and Chengwei Yao. 2018. The forgettable-watcher model for video question answering. Neurocomputing 314 (2018), 386--393.
[7]
Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. 2017. Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7445--7454.
[8]
Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. 2015. Are you talking to a machine? Dataset and methods for multilingual image question. In Advances in Neural Information Processing Systems. 2296--2304.
[9]
Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. 2018. Motion-appearance co-memory networks for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6576--6585.
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[11]
Richang Hong, Lei Li, Junjie Cai, Dapeng Tao, Meng Wang, and Qi Tian. 2017. Coherent semantic-visual indexing for large-scale image retrieval in the cloud. IEEE Trans. Image Process. 26, 9 (2017), 4128--4138.
[12]
Hexiang Hu, Wei-Lun Chao, and Fei Sha. 2018. Learning answer embeddings for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5428--5436.
[13]
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3--13.
[14]
Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. TGIF-QA: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1359--1367.
[15]
Yifan Jiao, Zhetao Li, Shucheng Huang, Xiaoshan Yang, Bin Liu, and Tianzhu Zhang. 2018. Three-dimensional attention-based deep ranking model for video highlight detection. IEEE Trans. Multimedia 20, 10 (2018), 2693--2705.
[16]
Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Multimodal residual learning for visual qa. In Advances in Neural Information Processing Systems. 361--369.
[17]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[18]
Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language processing. In Proceedings of the International Conference on Machine Learning. 1378--1387.
[19]
Yunan Li, Qiguang Miao, Kuan Tian, Yingying Fan, Xin Xu, Rui Li, and Jianfeng Song. 2017. Large-scale gesture recognition with a fusion of RGB-D data based on saliency theory and C3D model. IEEE Trans. Circ. Syst. Vid. Technol. 28, 10 (2017), 2956--2964.
[20]
Zhetao Li, Jie Zhang, Kaihua Zhang, and Zhiyong Li. 2018. Visual tracking with weighted adaptive local sparse appearance model via spatio-temporal context learning. IEEE Trans. Image Process. 27, 9 (2018), 4478--4489.
[21]
Junwei Liang, Lu Jiang, Liangliang Cao, Li-Jia Li, and Alexander Hauptmann. 2018. Focal visual-text attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6135--6143.
[22]
Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4--14.
[23]
Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2018. Context-aware visual policy network for sequence-level image captioning. In Proceedings of the 2018 ACM on Multimedia Conference. ACM, 1416--1424.
[24]
Jiawei Liu, Zheng-Jun Zha, Xuejin Chen, Zilei Wang, and Yongdong Zhang. 2019. Dense 3d-convolutional neural network for person re-identification in videos. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1 (2019), 8--30.
[25]
Jiawei Liu, Zheng-Jun Zha, QI Tian, Dong Liu, Ting Yao, Qiang Ling, and Tao Mei. 2016. Multi-scale triplet cnn for person re-identification. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 192--196.
[26]
Jiawei Liu, Zheng-Jun Zha, Hongtao Xie, Zhiwei Xiong, and Yongdong Zhang. 2018. CA3Net: Contextual-attentional attribute-appearance network for person re-identification. In Proceedings of the 2018 ACM on Multimedia Conference. ACM, 737--745.
[27]
Zhipeng Liu, Xiujuan Chai, Zhuang Liu, and Xilin Chen. 2017. Continuous gesture recognition with hand-oriented spatiotemporal feature. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3056--3064.
[28]
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2--12.
[29]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Proceedings of the 30th International Conference on Neural Information Processing Systems. 289--297.
[30]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems. 289--297.
[31]
Lin Ma, Zhengdong Lu, and Hang Li. 2016. Learning to answer questions from image using convolutional neural network. In Proceedings of the 30th AAAI Conference on Artificial Intelligence. 16--24.
[32]
Liqian Ma, Qianru Sun, Stamatios Georgoulis, Luc Van Gool, Bernt Schiele, and Mario Fritz. 2018. Disentangled person image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 99--108.
[33]
Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2017. Ask your neurons: A deep learning approach to visual question answering. Int. J. Comput. Vis. 125, 1--3 (2017), 110--135.
[34]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119.
[35]
Jonghwan Mun, Paul Hongsuck Seo, Ilchae Jung, and Bohyung Han. 2017. MarioQA: Answering questions by watching gameplay videos. In Proceedings of the IEEE International Conference on Computer Vision. 2886--2894.
[36]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1532--1543.
[37]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91--99.
[38]
Idan Schwartz, Alexander Schwing, and Tamir Hazan. 2017. High-order attention models for visual question answering. In Advances in Neural Information Processing Systems. 3664--3674.
[39]
Kevin J. Shih, Saurabh Singh, and Derek Hoiem. 2016. Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4613--4621.
[40]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[41]
Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4631--4640.
[42]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4489--4497.
[43]
Bo Wang, Youjiang Xu, Yahong Han, and Richang Hong. 2018. Movie question answering: Remembering the textual cues for layered visual contents. In Proceedings of the 32th AAAI Conference on Artificial Intelligence. 7380--7387.
[44]
Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In Proceedings of the International Conference on Machine Learning. 2397--2406.
[45]
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, New York, NY, 1645--1653.
[46]
Huijuan Xu and Kate Saenko. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In Proceedings of the European Conference on Computer Vision. 451--466.
[47]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288--5296.
[48]
Youjiang Xu, Yahong Han, Richang Hong, and Qi Tian. 2018. Sequential video VLAD: Training the aggregation locally and temporally. IEEE Trans. Image Process. 27, 10 (2018), 4933--4944.
[49]
Zhongwen Xu, Yi Yang, and Alex G. Hauptmann. 2015. A discriminative CNN video representation for event detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1798--1807.
[50]
Hongyang Xue, Zhou Zhao, and Deng Cai. 2017. Unifying the video and question attentions for open-ended video question answering. IEEE Trans. Image Process. 26, 12 (2017), 5656--5666.
[51]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21--29.
[52]
Dongfei Yu, Jianlong Fu, Tao Mei, and Yong Rui. 2017. Multi-level attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4187--4195.
[53]
Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4584--4593.
[54]
Zheng-Jun Zha, Hanwang Zhang, Meng Wang, Huanbo Luan, and Tat-Seng Chua. 2013. Detecting group activities with multi-camera context. IEEE Trans. Circ. Syst. Vid. Technol. 23, 5 (2013), 856--869.
[55]
Wei Zhang, Xiaodong Yu, and Xuanyu He. 2018. Learning bidirectional temporal cues for video-based person re-identification. IEEE Trans. Circ. Syst. Vid. Technol. 28, 10 (2018), 2768--2776.
[56]
Wei Zhang, Weidong Zhang, Kan Liu, and Jason Gu. 2018. A feature descriptor based on local normalized difference for real-world texture classification. IEEE Trans. Multimedia 20, 4 (2018), 880--888.
[57]
Zhou Zhao, Jinghao Lin, Xinghua Jiang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017. Video question answering via hierarchical dual-level attention network learning. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, New York, NY, 1050--1058.
[58]
Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, Yueting Zhuang, Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017. Video question answering via hierarchical spatio-temporal attention networks. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 3518--3524.
[59]
Zhou Zhao, Zhu Zhang, Shuwen Xiao, Zhou Yu, Jun Yu an Deng Cai, Fei Wu, and Yueting Zhuang. 2018. Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 3683--3689.
[60]
Chen Zhu, Yanpeng Zhao, Shuaiyi Huang, Kewei Tu, and Yi Ma. 2017. Structured attentions for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 1291--1300.
[61]
Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2017. Uncovering the temporal context for video question answering. Int. J. Comput. Vis. 124, 3 (2017), 409--421.
[62]
Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7w: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4995--5004.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 2s
Special Section on Cross-Media Analysis for Visual Question Answering, Special Section on Big Data, Machine Learning and AI Technologies for Art and Design and Special Section on MMSys/NOSSDAV 2018
April 2019
381 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3343360
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2019
Accepted: 01 March 2019
Revised: 01 March 2019
Received: 01 August 2018
Published in TOMM Volume 15, Issue 2s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Video question answering
  2. attention mechanism

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Natural Science Foundation of China
  • Fundamental Research Funds for the Central Universities

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)1
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Video Question AnsweringJournal of Visual Communication and Image Representation10.1016/j.jvcir.2024.104320105:COnline publication date: 11-Feb-2025
  • (2024)Harnessing Representative Spatial-Temporal Information for Video Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367539920:10(1-20)Online publication date: 5-Jul-2024
  • (2024)HMTV: hierarchical multimodal transformer for video highlight query on baseballMultimedia Systems10.1007/s00530-024-01479-630:5Online publication date: 23-Sep-2024
  • (2024)Appearance-Motion Dual-Stream Heterogeneous Network for VideoQAMultiMedia Modeling10.1007/978-3-031-53311-2_16(212-227)Online publication date: 29-Jan-2024
  • (2023)Video Question Answering with Overcoming Spatial and Temporal Redundancy in Feature ExtractionJOURNAL OF BROADCAST ENGINEERING10.5909/JBE.2023.28.7.84928:7(849-858)Online publication date: 31-Dec-2023
  • (2023)Adaptive enhancement design of non-significant regions of a Wushu action 3D image based on the symmetric difference algorithmMathematical Biosciences and Engineering10.3934/mbe.202366220:8(14793-14810)Online publication date: 2023
  • (2023)Hierarchical Synergy-Enhanced Multimodal Relational Network for Video Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363010120:4(1-22)Online publication date: 11-Dec-2023
  • (2023)Cross-modality Multiple Relations Learning for Knowledge-based Visual Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361830120:3(1-22)Online publication date: 23-Oct-2023
  • (2023)Fine-Grained Text-to-Video Temporal Grounding from Coarse BoundaryACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357982519:5(1-21)Online publication date: 16-Mar-2023
  • (2023)Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/355027619:2(1-18)Online publication date: 6-Feb-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media