Appearance-Motion Dual-Stream Heterogeneous Network for VideoQA

Xu, Feifei; Zhong, Zheng; Zhu, Yitao; Zhou, Yingchen; Li, Guangzhen

doi:10.1007/978-3-031-53311-2_16

Feifei Xu¹⁴,
Zheng Zhong¹⁴,
Yitao Zhu¹⁵,
Yingchen Zhou¹⁴ &
…
Guangzhen Li¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14556))

Included in the following conference series:

International Conference on Multimedia Modeling

384 Accesses

Abstract

Capturing spatio-temporal information in videos related to the question remains a key challenge in video question answering task (VideoQA). Though great success has been achieved in VideoQA, most of the existing methods do not sufficiently consider the correlation among appearance, motion, and object features, making it difficult to fully exploit the spatio-temporal relationships at different granularities. Besides, recent researches typically use the same interaction method when different features in the video interact with the question features separately, which ignores the spatio-temporal characteristics of the appearance and motion features in the video which leads to the problem of spatio-temporal mismatch. In this paper, we propose an Appearance-Motion Dual-stream Heterogeneous Network for VideoQA (AMHN), which pays attention to the synergy among three different features by heterogeneous interactions in terms of their spatio-temporal characteristics. AMHN unites object features with appearance features and motion features respectively to obtain two high-level visual representations containing object information. Then they are fed into the object-relational reasoning module to acquire relation-aware visual features. We use a bilinear attention network for appearance and put forward a Video-Text Symmetric Attention Network (VTSAN) for motion to achieve diverse features, which are fused under the guidance of the question to predict the final answer. We evaluate the performance of AMHN on two VideoQA benchmark datasets and perform an extensive ablation study. The experimental results demonstrate its state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Yang, Z., et al.: Stacked attention networks for image question answering. In: Proceedings of CVPR (2016)
Google Scholar
Fan, C., et al.: Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of CVPR (2019)
Google Scholar
Xiao, J., et al. :NExT-QA: next phase of question-answering to explaining temporal actions. In: Proceedings of CVPR (2021)
Google Scholar
Liu, Y., et al.: Cross-Attentional Spatio-Temporal semantic graph networks for video question answering. In: Proceedings of Image Processing (2022)
Google Scholar
He, K., et al.: Deep residual learning for image recognition. In: Proceedings of CVPR (2016)
Google Scholar
Hara, K., et al.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imageNet? In: Proceedings of CVPR (2017)
Google Scholar
Kim, J-H., et al.: Bilinear Attention Networks. In: Proceedings of NeurIPS (2018)
Google Scholar
Xu, D., et al.: Video question answering via gradually refined attention over appearance and motion. In: Proceedings of ACM (2017)
Google Scholar
Zhao, Z., et al.: Video question answering via hierarchical Dual-Level attention network learning. In: Proceedings of ACM (2017)
Google Scholar
Simonyan, K., Zisserman, A.:Very deep convolutional networks for Large-Scale image recognition. CoRR abs/1409.1556 (2014)
Google Scholar
Tran, D., et al.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of ICCV (2015)
Google Scholar
Jang, Y., et al.: TGIF-QA: Toward Spatio-Temporal reasoning in visual question answering. In: Proceedings of CVPR (2017)
Google Scholar
Jiang, J., et al.: Divide and conquer: Question-Guided Spatio-Temporal contextual attention for video question answering. In: Proceedings of AAAI (2020)
Google Scholar
Gao, J. et al.: Motion-Appearance Co-memory networks for video question answering. In: Proceedings of CVPR (2018)
Google Scholar
Huang, D., et al.: Location-Aware graph convolutional networks for video question answering. In: Proceedings of AAAI (2020)
Google Scholar
Zeng, K-H et al.: Leveraging video descriptions to learn video question answering. In: Proceedings of AAAI (2016)
Google Scholar
Zhu, L., et al.: Uncovering the temporal context for video question answering. In: Proceedings of IJCV (2017)
Google Scholar
Zhao, Z., et al.: Multi-Turn video question answering via Multi-Stream hierarchical attention context network. In: Proceedings of IJCAI (2018)
Google Scholar
Ren, S., et al.: Faster R-CNN: towards Real-Time object detection with region proposal networks. In: Proceedings of TPAMI (2015)
Google Scholar
Devlin, J., et al.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT (2019)
Google Scholar
He, K., et al.: Mask R-CNN. In: Proceedings of TPMAI (2017)
Google Scholar
Seo, A., et al.: Attend what you need: Motion-Appearance synergistic networks for video question answering. In: Proceedings of ACL (2021)
Google Scholar
krishna, R., et al.: Visual Genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7
Article MathSciNet Google Scholar
Deng, J., et al. ImageNet: a large-scale hierarchical image database. In: Proceedings of CVPR (2009)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset. ArXiv abs/1705.06950 (2017)
Google Scholar
Kingma, D., et al.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014)
Google Scholar
Ye, Y., et al.: Video question answering via Attribute-Augmented attention network learning. In: Proceedings of SIGIR (2017)
Google Scholar
Chen, D.L., William B.D.: Collecting Highly Parallel Data for Paraphrase Evaluation. In: Proceedings of ACL (2011)
Google Scholar
Xu, J., et al.; MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of CVPR (2016)
Google Scholar
Zha, Z., et al.: Spatiotemporal-Textual Co-Attention network for video question answering. In: Proceedings of TOMM (2019)
Google Scholar
Yu, T., et al.: Compositional attention networks with Two-Stream fusion for video question answering. In: Proceedings of TIP (2020)
Google Scholar
Jiang, P., Han, Y.: Reasoning with heterogeneous graph alignment for video question Answering. In: Proceedings of AAAI (2020)
Google Scholar
Cai, J., et al.: Feature augmented memory with global attention network for VideoQA. In: Proceedings of IJCAI (2020)
Google Scholar
Gu, M., et al.: Graph-Based Multi-Interaction network for video question answering. In: Proceedings of TIP(2021)
Google Scholar
Abdessaied, A., et al.: Video language Co-Attention with multimodal Fast-Learning feature fusion for VideoQA. In: Proceedings of RepL4NLP (2022)
Google Scholar
Liu, Y., et al.: Dynamic self-attention with vision synchronization networks for video question answering. In: Proceedings of ICPR (2022)
Google Scholar
Li, X., et al.: Complementary spatiotemporal network for video question answering. In: Proceedings of Multimed Syst (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Shanghai University of Electric Power, Shanghai, China
Feifei Xu, Zheng Zhong, Yingchen Zhou & Guangzhen Li
China Techenergy Co., Ltd., Shanghai, China
Yitao Zhu

Authors

Feifei Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Yitao Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yingchen Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Guangzhen Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zheng Zhong .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
Delft University of Technology, Delft, The Netherlands
Alan Hanjalic
Delft University of Technology, Delft, The Netherlands
Cynthia Liem
University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Reykjavik University, Reykjavik, Iceland
Björn Þór Jónsson
Microsoft Research Lab – Asia, Beijing, China
Bei Liu
The University of Tokyo, Tokyo, Japan
Yoko Yamakata

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, F., Zhong, Z., Zhu, Y., Zhou, Y., Li, G. (2024). Appearance-Motion Dual-Stream Heterogeneous Network for VideoQA. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14556. Springer, Cham. https://doi.org/10.1007/978-3-031-53311-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-53311-2_16
Published: 28 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53310-5
Online ISBN: 978-3-031-53311-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Appearance-Motion Dual-Stream Heterogeneous Network for VideoQA