Skip to main content

Appearance-Motion Dual-Stream Heterogeneous Network for VideoQA

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14556))

Included in the following conference series:

  • 384 Accesses

Abstract

Capturing spatio-temporal information in videos related to the question remains a key challenge in video question answering task (VideoQA). Though great success has been achieved in VideoQA, most of the existing methods do not sufficiently consider the correlation among appearance, motion, and object features, making it difficult to fully exploit the spatio-temporal relationships at different granularities. Besides, recent researches typically use the same interaction method when different features in the video interact with the question features separately, which ignores the spatio-temporal characteristics of the appearance and motion features in the video which leads to the problem of spatio-temporal mismatch. In this paper, we propose an Appearance-Motion Dual-stream Heterogeneous Network for VideoQA (AMHN), which pays attention to the synergy among three different features by heterogeneous interactions in terms of their spatio-temporal characteristics. AMHN unites object features with appearance features and motion features respectively to obtain two high-level visual representations containing object information. Then they are fed into the object-relational reasoning module to acquire relation-aware visual features. We use a bilinear attention network for appearance and put forward a Video-Text Symmetric Attention Network (VTSAN) for motion to achieve diverse features, which are fused under the guidance of the question to predict the final answer. We evaluate the performance of AMHN on two VideoQA benchmark datasets and perform an extensive ablation study. The experimental results demonstrate its state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Yang, Z., et al.: Stacked attention networks for image question answering. In: Proceedings of CVPR (2016)

    Google Scholar 

  2. Fan, C., et al.: Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of CVPR (2019)

    Google Scholar 

  3. Xiao, J., et al. :NExT-QA: next phase of question-answering to explaining temporal actions. In: Proceedings of CVPR (2021)

    Google Scholar 

  4. Liu, Y., et al.: Cross-Attentional Spatio-Temporal semantic graph networks for video question answering. In: Proceedings of Image Processing (2022)

    Google Scholar 

  5. He, K., et al.: Deep residual learning for image recognition. In: Proceedings of CVPR (2016)

    Google Scholar 

  6. Hara, K., et al.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imageNet? In: Proceedings of CVPR (2017)

    Google Scholar 

  7. Kim, J-H., et al.: Bilinear Attention Networks. In: Proceedings of NeurIPS (2018)

    Google Scholar 

  8. Xu, D., et al.: Video question answering via gradually refined attention over appearance and motion. In: Proceedings of ACM (2017)

    Google Scholar 

  9. Zhao, Z., et al.: Video question answering via hierarchical Dual-Level attention network learning. In: Proceedings of ACM (2017)

    Google Scholar 

  10. Simonyan, K., Zisserman, A.:Very deep convolutional networks for Large-Scale image recognition. CoRR abs/1409.1556 (2014)

    Google Scholar 

  11. Tran, D., et al.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of ICCV (2015)

    Google Scholar 

  12. Jang, Y., et al.: TGIF-QA: Toward Spatio-Temporal reasoning in visual question answering. In: Proceedings of CVPR (2017)

    Google Scholar 

  13. Jiang, J., et al.: Divide and conquer: Question-Guided Spatio-Temporal contextual attention for video question answering. In: Proceedings of AAAI (2020)

    Google Scholar 

  14. Gao, J. et al.: Motion-Appearance Co-memory networks for video question answering. In: Proceedings of CVPR (2018)

    Google Scholar 

  15. Huang, D., et al.: Location-Aware graph convolutional networks for video question answering. In: Proceedings of AAAI (2020)

    Google Scholar 

  16. Zeng, K-H et al.: Leveraging video descriptions to learn video question answering. In: Proceedings of AAAI (2016)

    Google Scholar 

  17. Zhu, L., et al.: Uncovering the temporal context for video question answering. In: Proceedings of IJCV (2017)

    Google Scholar 

  18. Zhao, Z., et al.: Multi-Turn video question answering via Multi-Stream hierarchical attention context network. In: Proceedings of IJCAI (2018)

    Google Scholar 

  19. Ren, S., et al.: Faster R-CNN: towards Real-Time object detection with region proposal networks. In: Proceedings of TPAMI (2015)

    Google Scholar 

  20. Devlin, J., et al.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT (2019)

    Google Scholar 

  21. He, K., et al.: Mask R-CNN. In: Proceedings of TPMAI (2017)

    Google Scholar 

  22. Seo, A., et al.: Attend what you need: Motion-Appearance synergistic networks for video question answering. In: Proceedings of ACL (2021)

    Google Scholar 

  23. krishna, R., et al.: Visual Genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7

    Article  MathSciNet  Google Scholar 

  24. Deng, J., et al. ImageNet: a large-scale hierarchical image database. In: Proceedings of CVPR (2009)

    Google Scholar 

  25. Kay, W., et al.: The kinetics human action video dataset. ArXiv abs/1705.06950 (2017)

    Google Scholar 

  26. Kingma, D., et al.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014)

    Google Scholar 

  27. Ye, Y., et al.: Video question answering via Attribute-Augmented attention network learning. In: Proceedings of SIGIR (2017)

    Google Scholar 

  28. Chen, D.L., William B.D.: Collecting Highly Parallel Data for Paraphrase Evaluation. In: Proceedings of ACL (2011)

    Google Scholar 

  29. Xu, J., et al.; MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of CVPR (2016)

    Google Scholar 

  30. Zha, Z., et al.: Spatiotemporal-Textual Co-Attention network for video question answering. In: Proceedings of TOMM (2019)

    Google Scholar 

  31. Yu, T., et al.: Compositional attention networks with Two-Stream fusion for video question answering. In: Proceedings of TIP (2020)

    Google Scholar 

  32. Jiang, P., Han, Y.: Reasoning with heterogeneous graph alignment for video question Answering. In: Proceedings of AAAI (2020)

    Google Scholar 

  33. Cai, J., et al.: Feature augmented memory with global attention network for VideoQA. In: Proceedings of IJCAI (2020)

    Google Scholar 

  34. Gu, M., et al.: Graph-Based Multi-Interaction network for video question answering. In: Proceedings of TIP(2021)

    Google Scholar 

  35. Abdessaied, A., et al.: Video language Co-Attention with multimodal Fast-Learning feature fusion for VideoQA. In: Proceedings of RepL4NLP (2022)

    Google Scholar 

  36. Liu, Y., et al.: Dynamic self-attention with vision synchronization networks for video question answering. In: Proceedings of ICPR (2022)

    Google Scholar 

  37. Li, X., et al.: Complementary spatiotemporal network for video question answering. In: Proceedings of Multimed Syst (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zheng Zhong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, F., Zhong, Z., Zhu, Y., Zhou, Y., Li, G. (2024). Appearance-Motion Dual-Stream Heterogeneous Network for VideoQA. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14556. Springer, Cham. https://doi.org/10.1007/978-3-031-53311-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-53311-2_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-53310-5

  • Online ISBN: 978-3-031-53311-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics