research-article

Question-Aware Tube-Switch Network for Video Question Answering

Authors:

Hanwang ZhangAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 1184 - 1192

https://doi.org/10.1145/3343031.3350969

Published: 15 October 2019 Publication History

Abstract

Video Question & Answering (VideoQA), a task to answer questions in videos, involves rich spatio-temporal content (e.g., appearance and motion) and requires multi-hop reasoning process. However, existing methods usually deal with appearance and motion separately and fail to synchronize the attentions on appearance and motion features, neglecting two key properties of video QA: (1) appearance and motion features are usually concomitant and complementary to each other at time slice level. Some questions rely on joint representations of both kinds of features at some point in the video; (2) appearance and motion have different importance in multi-step reasoning. In this paper, we propose a novel Question- Aware Tube-Switch Network (TSN) for video question answering which contains (1) a Mix module to synchronously combine the appearance and motion representation at time slice level, achieving fine-grained temporal alignment and correspondence between appearance and motion at every time slice and (2) a Switch mod- ule to adaptively choose appearance or motion tube as primary at each reasoning step, guiding the multi-hop reasoning process. To end-to-end train TSN, we utilize the Gumbel-Softmax strategy to account for the discrete tube-switch process. Extensive experimental results on two benchmarks: MSVD-QA and MSRVTT-QA, have demonstrated that the proposed TSN consistently outperforms state-of-the-art on all metrics.

References

[1]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425--2433.

Digital Library

[2]

David L Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 190--200.

Digital Library

[3]

Wenqing Chu, Hongyang Xue, Zhou Zhao, Deng Cai, and Chengwei Yao. 2018. The forgettable-watcher model for video question answering. Neurocomputing, Vol. 314 (2018), 386--393.

[4]

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. NIPS (2014).

[5]

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In CVPR .

[6]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.

[7]

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. (2016).

[8]

Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. 2018. Motion-appearance co-memory networks for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6576--6585.

[9]

Emil Julius Gumbel. 1954. Statistical theory of extreme values and some practical applications: a series of lectures . Vol. 33. US Government Printing Office.

[10]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[11]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation (1997).

[12]

Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. 2017. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1115--1124.

[13]

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2758--2766.

[14]

Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016).

[15]

Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR .

[16]

Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, and Chuang Gan. 2019. Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering. In AAAI .

[17]

Yunan Li, Qiguang Miao, Kuan Tian, Yingying Fan, Xin Xu, Rui Li, and Jianfeng Song. 2018. Large-scale gesture recognition with a fusion of rgb-d data based on saliency theory and c3d model. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 28, 10 (2018), 2956--2964.

Digital Library

[18]

Junwei Liang, Lu Jiang, Liangliang Cao, Li-Jia Li, and Alexander G Hauptmann. 2018. Focal visual-text attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6135--6143.

[19]

Daqing Liu, Hanwang Zhang, Zheng-Jun Zha, and Feng Wu. 2018. Explainability by Parsing: Neural Module Tree Networks for Natural Language Visual Grounding. arXiv preprint arXiv:1812.03299 (2018).

[20]

Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, and Dhruv Batra. 2017. Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model. In NIPS .

[21]

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In NIPS .

[22]

Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual attention networks for multimodal reasoning and matching. In CVPR .

[23]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) . 1532--1543.

[24]

Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR .

[25]

Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. 2018. Tips and tricks for visual question answering: Learnings from the 2017 challenge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 4223--4232.

[26]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.

Digital Library

[27]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.

[28]

Meng Wang, Richang Hong, Guangda Li, Zheng-Jun Zha, Shuicheng Yan, and Tat-Seng Chua. 2012. Event driven web video summarization by tag localization and key-shot identification. IEEE Transactions on Multimedia, Vol. 14, 4 (2012), 975--985.

Digital Library

[29]

Meng Wang, Xian-Sheng Hua, Richang Hong, Jinhui Tang, Guo-Jun Qi, and Yan Song. 2009. Unified video annotation via multigraph learning. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 19, 5 (2009), 733--746.

Digital Library

[30]

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia. ACM, 1645--1653.

Digital Library

[31]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 5288--5296.

[32]

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In CVPR .

[33]

Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 1307--1315.

[34]

Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE international conference on computer vision. 1821--1830.

[35]

Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, and Min Sun. 2017. Leveraging video descriptions to learn video question answering. In Thirty-First AAAI Conference on Artificial Intelligence .

Digital Library

[36]

Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017. Video Question Answering via Hierarchical Spatio-Temporal Attention Networks. In IJCAI . 3518--3524.

[37]

Chen Zhu, Yanpeng Zhao, Shuaiyi Huang, Kewei Tu, and Yi Ma. 2017b. Structured attentions for visual question answering. In ICCV .

[38]

Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G Hauptmann. 2017a. Uncovering the temporal context for video question answering. International Journal of Computer Vision, Vol. 124, 3 (2017), 409--421.

Digital Library

Cited By

Yu TFu KWang SHuang QYu J(2025)Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.347551035:2(1615-1630)Online publication date: Feb-2025
https://doi.org/10.1109/TCSVT.2024.3475510
Yu TFu KZhang JHuang QYu J(2024)Multi-Granularity Contrastive Cross-Modal Collaborative Generation for End-to-End Long-Term Video Question AnsweringIEEE Transactions on Image Processing10.1109/TIP.2024.339098433(3115-3129)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3390984
Jiang JLiu ZZheng N(2023)LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2022.318590025(5002-5013)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3185900
Show More Cited By

Index Terms

Question-Aware Tube-Switch Network for Video Question Answering
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Spatiotemporal-Textual Co-Attention Network for Video Question Answering
Special Section on Cross-Media Analysis for Visual Question Answering, Special Section on Big Data, Machine Learning and AI Technologies for Art and Design and Special Section on MMSys/NOSSDAV 2018

Visual Question Answering (VQA) is to provide a natural language answer for a pair of an image or video and a natural language question. Despite recent progress on VQA, existing works primarily focus on image question answering and are suboptimal for ...
Multi-interaction Network with Object Relation for Video Question Answering
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Video question answering is an important task for testing machine's ability of video understanding. The existing methods normally focus on the combination of recurrent and convolutional neural networks to capture spatial and temporal information of the ...
Video Question Answering via Gradually Refined Attention over Appearance and Motion
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Recently image question answering (ImageQA) has gained lots of attention in the research community. However, as its natural extension, video question answering (VideoQA) is less explored. Although both tasks look similar, VideoQA is more challenging ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Program of China
Ministry of Education of P.R. China
National Natural Science Foundation of China

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
426
Total Downloads

Downloads (Last 12 months)26
Downloads (Last 6 weeks)2

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yu TFu KWang SHuang QYu J(2025)Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.347551035:2(1615-1630)Online publication date: Feb-2025
https://doi.org/10.1109/TCSVT.2024.3475510
Yu TFu KZhang JHuang QYu J(2024)Multi-Granularity Contrastive Cross-Modal Collaborative Generation for End-to-End Long-Term Video Question AnsweringIEEE Transactions on Image Processing10.1109/TIP.2024.339098433(3115-3129)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3390984
Jiang JLiu ZZheng N(2023)LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2022.318590025(5002-5013)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3185900
Ye SKong WYao CRen JJiang X(2023)Video Question Answering Using Clip-Guided Visual-Text Attention2023 IEEE International Conference on Image Processing (ICIP)10.1109/ICIP49359.2023.10222286(81-85)Online publication date: 8-Oct-2023
https://doi.org/10.1109/ICIP49359.2023.10222286
Wang ZLi FOta KDong MWu B(2023)ReGRInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10337560:4Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1016/j.ipm.2023.103375
Wang JBao BXu C(2022)DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2021.309717124(3369-3380)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1109/TMM.2021.3097171
Ben HPan YLi YYao THong RWang MMei T(2022)Unpaired Image Captioning With semantic-Constrained Self-LearningIEEE Transactions on Multimedia10.1109/TMM.2021.306094824(904-916)Online publication date: 2022
https://doi.org/10.1109/TMM.2021.3060948
Liu YZhang XHuang FZhang BLi Z(2022)Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question AnsweringIEEE Transactions on Image Processing10.1109/TIP.2022.314252631(1684-1696)Online publication date: 2022
https://doi.org/10.1109/TIP.2022.3142526
Gao LLei YZeng PSong JWang MShen H(2022)Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video Question AnsweringIEEE Transactions on Image Processing10.1109/TIP.2021.312086731(202-215)Online publication date: 2022
https://doi.org/10.1109/TIP.2021.3120867
Zhang JShao JCao RGao LXu XShen H(2022)Action-Centric Relation Transformer Network for Video Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2020.304844032:1(63-74)Online publication date: Jan-2022
https://doi.org/10.1109/TCSVT.2020.3048440
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten