Skip to main content

Locating Visual Explanations for Video Question Answering

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12572))

Included in the following conference series:

Abstract

Although promising performance has been reported for Video Question Answering (VideoQA) in recent years, there is still a large gap for human to truly understand the model decisions. Besides, beyond a short answer, complementary visual information is desirable to enhance and elucidate the content of QA pairs. To this end, we introduce a new task called Video Question Answering with Visual Explanations (VQA-VE), which requires to generate answers and provide visual explanations (i.e., locating relevant moments within the whole video) simultaneously. This task bridges video question answering and temporal localization. They are two separate and typical visual tasks and come with our challenge. For training and evaluation, we build a new dataset on top of ActivityNet Captions by annotating QA pairs with temporal ground-truth. We also adopt a large-scale benchmark TVQA. Towards VQA-VE, we develop a new model that is able to generate complete natural language sentences as answers while locating relevant moments with various time spans in a multi-task framework. We also introduce two metrics to fairly measure the performance on VQA-VE. Experimental results not only show the effectiveness of our model, but also demonstrate that additional supervision from visual explanations can improve the performance of models on traditional VideoQA task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)

    Google Scholar 

  2. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP (2014)

    Google Scholar 

  3. Denkowski, M., Lavie, A.: Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)

    Google Scholar 

  4. Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)

  5. Gao, J., Ge, R., Chen, K., Nevatia, R.: Motion-appearance co-memory networks for video question answering. In: CVPR (2018)

    Google Scholar 

  6. Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: temporal activity localization via language query. In: ICCV (2017)

    Google Scholar 

  7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  8. Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)

    Google Scholar 

  9. Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Niebles, J.C.: Dense-captioning events in videos. In: ICCV (2017)

    Google Scholar 

  10. Lei, J., Yu, L., Bansal, M., Berg, T.L.: TVQA: localized, compositional video question answering. In: EMNLP (2018)

    Google Scholar 

  11. Liang, J., Jiang, L., Cao, L., Kalantidis, Y., Li, L.J., Hauptmann, A.G.: Focal visual-text attention for memex question answering. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1893–1908 (2019)

    Article  Google Scholar 

  12. Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 3–21. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_1

    Chapter  Google Scholar 

  13. Maharaj, T., Ballas, N., Rohrbach, A., Courville, A., Pal, C.: A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In: CVPR, pp. 7359–7368 (2017)

    Google Scholar 

  14. Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: NIPS, pp. 1682–1690 (2014)

    Google Scholar 

  15. Park, D.H., et al.: Multimodal explanations: justifying decisions and pointing to the evidence. In: CVPR (2018)

    Google Scholar 

  16. Song, X., Shi, Y., Chen, X., Han, Y.: Explore multi-step reasoning for video question answering. In: ACM MM (2018)

    Google Scholar 

  17. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)

    Google Scholar 

  18. Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence - video to text. In: ICCV (2015)

    Google Scholar 

  19. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)

    Google Scholar 

  20. Xue, H., Zhao, Z., Cai, D.: Unifying the video and question attentions for open-ended video question answering. IEEE TIP 26, 5656–5666 (2017)

    MathSciNet  MATH  Google Scholar 

  21. Yu, Y., Ko, H., Choi, J., Kim, G.: End-to-end concept word detection for video captioning, retrieval, and question answering. In: CVPR (2017)

    Google Scholar 

  22. Yu, Z., et al..: ActivityNet-QA: a dataset for understanding complex web videos via question answering. arXiv preprint arXiv:1906.02467 (2019)

  23. Zeng, K., Chen, T., Chuang, C., Liao, Y., Niebles, J.C., Sun, M.: Leveraging video descriptions to learn video question answering. In: AAAI (2017)

    Google Scholar 

  24. Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: ICCV (2017)

    Google Scholar 

  25. Zhao, Z., et al.: Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In: IJCAI, pp. 3683–3689 (2018)

    Google Scholar 

Download references

Acknowledgement

This work is supported by the NSFC (under Grant 61876130, 61932009).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yahong Han .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, X., Liu, R., Song, X., Han, Y. (2021). Locating Visual Explanations for Video Question Answering. In: Lokoč, J., et al. MultiMedia Modeling. MMM 2021. Lecture Notes in Computer Science(), vol 12572. Springer, Cham. https://doi.org/10.1007/978-3-030-67832-6_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-67832-6_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-67831-9

  • Online ISBN: 978-3-030-67832-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics