skip to main content
10.1145/3343031.3350969acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Question-Aware Tube-Switch Network for Video Question Answering

Published: 15 October 2019 Publication History

Abstract

Video Question & Answering (VideoQA), a task to answer questions in videos, involves rich spatio-temporal content (e.g., appearance and motion) and requires multi-hop reasoning process. However, existing methods usually deal with appearance and motion separately and fail to synchronize the attentions on appearance and motion features, neglecting two key properties of video QA: (1) appearance and motion features are usually concomitant and complementary to each other at time slice level. Some questions rely on joint representations of both kinds of features at some point in the video; (2) appearance and motion have different importance in multi-step reasoning. In this paper, we propose a novel Question- Aware Tube-Switch Network (TSN) for video question answering which contains (1) a Mix module to synchronously combine the appearance and motion representation at time slice level, achieving fine-grained temporal alignment and correspondence between appearance and motion at every time slice and (2) a Switch mod- ule to adaptively choose appearance or motion tube as primary at each reasoning step, guiding the multi-hop reasoning process. To end-to-end train TSN, we utilize the Gumbel-Softmax strategy to account for the discrete tube-switch process. Extensive experimental results on two benchmarks: MSVD-QA and MSRVTT-QA, have demonstrated that the proposed TSN consistently outperforms state-of-the-art on all metrics.

References

[1]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425--2433.
[2]
David L Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 190--200.
[3]
Wenqing Chu, Hongyang Xue, Zhou Zhao, Deng Cai, and Chengwei Yao. 2018. The forgettable-watcher model for video question answering. Neurocomputing, Vol. 314 (2018), 386--393.
[4]
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. NIPS (2014).
[5]
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In CVPR .
[6]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.
[7]
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. (2016).
[8]
Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. 2018. Motion-appearance co-memory networks for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6576--6585.
[9]
Emil Julius Gumbel. 1954. Statistical theory of extreme values and some practical applications: a series of lectures . Vol. 33. US Government Printing Office.
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[11]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation (1997).
[12]
Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. 2017. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1115--1124.
[13]
Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2758--2766.
[14]
Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016).
[15]
Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR .
[16]
Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, and Chuang Gan. 2019. Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering. In AAAI .
[17]
Yunan Li, Qiguang Miao, Kuan Tian, Yingying Fan, Xin Xu, Rui Li, and Jianfeng Song. 2018. Large-scale gesture recognition with a fusion of rgb-d data based on saliency theory and c3d model. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 28, 10 (2018), 2956--2964.
[18]
Junwei Liang, Lu Jiang, Liangliang Cao, Li-Jia Li, and Alexander G Hauptmann. 2018. Focal visual-text attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6135--6143.
[19]
Daqing Liu, Hanwang Zhang, Zheng-Jun Zha, and Feng Wu. 2018. Explainability by Parsing: Neural Module Tree Networks for Natural Language Visual Grounding. arXiv preprint arXiv:1812.03299 (2018).
[20]
Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, and Dhruv Batra. 2017. Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model. In NIPS .
[21]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In NIPS .
[22]
Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual attention networks for multimodal reasoning and matching. In CVPR .
[23]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) . 1532--1543.
[24]
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR .
[25]
Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. 2018. Tips and tricks for visual question answering: Learnings from the 2017 challenge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 4223--4232.
[26]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.
[27]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.
[28]
Meng Wang, Richang Hong, Guangda Li, Zheng-Jun Zha, Shuicheng Yan, and Tat-Seng Chua. 2012. Event driven web video summarization by tag localization and key-shot identification. IEEE Transactions on Multimedia, Vol. 14, 4 (2012), 975--985.
[29]
Meng Wang, Xian-Sheng Hua, Richang Hong, Jinhui Tang, Guo-Jun Qi, and Yan Song. 2009. Unified video annotation via multigraph learning. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 19, 5 (2009), 733--746.
[30]
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia. ACM, 1645--1653.
[31]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 5288--5296.
[32]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In CVPR .
[33]
Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 1307--1315.
[34]
Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE international conference on computer vision. 1821--1830.
[35]
Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, and Min Sun. 2017. Leveraging video descriptions to learn video question answering. In Thirty-First AAAI Conference on Artificial Intelligence .
[36]
Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017. Video Question Answering via Hierarchical Spatio-Temporal Attention Networks. In IJCAI . 3518--3524.
[37]
Chen Zhu, Yanpeng Zhao, Shuaiyi Huang, Kewei Tu, and Yi Ma. 2017b. Structured attentions for visual question answering. In ICCV .
[38]
Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G Hauptmann. 2017a. Uncovering the temporal context for video question answering. International Journal of Computer Vision, Vol. 124, 3 (2017), 409--421.

Cited By

View all
  • (2025)Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.347551035:2(1615-1630)Online publication date: Feb-2025
  • (2024)Multi-Granularity Contrastive Cross-Modal Collaborative Generation for End-to-End Long-Term Video Question AnsweringIEEE Transactions on Image Processing10.1109/TIP.2024.339098433(3115-3129)Online publication date: 2024
  • (2023)LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2022.318590025(5002-5013)Online publication date: 1-Jan-2023
  • Show More Cited By

Index Terms

  1. Question-Aware Tube-Switch Network for Video Question Answering

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '19: Proceedings of the 27th ACM International Conference on Multimedia
    October 2019
    2794 pages
    ISBN:9781450368896
    DOI:10.1145/3343031
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 October 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. appearance and motion
    2. video question answering
    3. visual attention

    Qualifiers

    • Research-article

    Funding Sources

    • National Key Research and Development Program of China
    • Ministry of Education of P.R. China
    • National Natural Science Foundation of China

    Conference

    MM '19
    Sponsor:

    Acceptance Rates

    MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)26
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.347551035:2(1615-1630)Online publication date: Feb-2025
    • (2024)Multi-Granularity Contrastive Cross-Modal Collaborative Generation for End-to-End Long-Term Video Question AnsweringIEEE Transactions on Image Processing10.1109/TIP.2024.339098433(3115-3129)Online publication date: 2024
    • (2023)LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2022.318590025(5002-5013)Online publication date: 1-Jan-2023
    • (2023)Video Question Answering Using Clip-Guided Visual-Text Attention2023 IEEE International Conference on Image Processing (ICIP)10.1109/ICIP49359.2023.10222286(81-85)Online publication date: 8-Oct-2023
    • (2023)ReGRInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10337560:4Online publication date: 1-Jul-2023
    • (2022)DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2021.309717124(3369-3380)Online publication date: 1-Jan-2022
    • (2022)Unpaired Image Captioning With semantic-Constrained Self-LearningIEEE Transactions on Multimedia10.1109/TMM.2021.306094824(904-916)Online publication date: 2022
    • (2022)Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question AnsweringIEEE Transactions on Image Processing10.1109/TIP.2022.314252631(1684-1696)Online publication date: 2022
    • (2022)Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video Question AnsweringIEEE Transactions on Image Processing10.1109/TIP.2021.312086731(202-215)Online publication date: 2022
    • (2022)Action-Centric Relation Transformer Network for Video Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2020.304844032:1(63-74)Online publication date: Jan-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media