Skip to main content
Log in

Video Question Answering: a Survey of Models and Datasets

  • Published:
Mobile Networks and Applications Aims and scope Submit manuscript

Abstract

Video question answering (VideoQA) automatically answers natural language question according to the content of videos. It promotes the development of online education, scenario analysis, video content retrieving, etc. VideoQA is a challenging task because it requires a model to understand semantic information of the video and the question to generate the answer. Firstly, we propose a general framework of VideoQA which consists of a video feature extraction module, a text feature extraction module, an integration module, and an answer generation module. The integration module is the core module, including core processing model, recurrent neural networks (RNNs) encoder and feature fusion. These three sub-modules cooperate to generate the contextual representation, and the answer generation module generates the answer based on it. Then, we summarize the methods in core processing model, and introduce the ideas and applications of the methods in detail, such as encoder-decoder, attention model, and memory network and other methods. Additionally, we introduce the widely used datasets and evaluation criteria, as well as the analysis of experimental results on benchmark datasets. Finally, we discuss challenges in the field of VideoQA and provide some possible directions for future work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Lin T Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection[C]. Proceedings of the IEEE conference on computer vision and pattern recognition, 2117–2125

  2. Maninis K K , Caelles S , Chen Y (2018) Video object segmentation without temporal information[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(6):1515–1530

  3. Wong W K , Lai Z, Wen J, Fang X, Lu Y (2017) Low-rank embedding for robust image feature extraction[J]. IEEE Transactions on Image Processing 26(6): 2905–2917

  4. Lu J, Peng Y, Qi GJ, Jun Y (2020) Guest editorial introduction to the special section on representation learning for visual content understanding[J]. IEEE Transactions on Circuits and Systems for Video Technology 30(9):2797–2800

  5. Anjum A, Abdullah T, Tariq MF, Baltaci Y, Antonopoulos N (2019) Video stream analysis in clouds: an object detection and classification framework for high performance video analytics [J]. IEEE Trans Cloud Comput 7(4):1152–1167

  6. Ren M, Kiros R, Zemel R (2015) Image Question Answering: A Visual Semantic Embedding Model and a New Dataset[J]. Proc. Advances in Neural Inf. Process. Syst 1(2):5.

  7. Ren S, He K, Girshick R, Sun J (2016) Faster r-cnn: Towards real-time object detection with region proposal networks[J]. IEEE transactions on pattern analysis and machine intelligence 39(6):1137–1149

  8. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition[J]. Computer Science

  9. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision[C]. Proceedings of the IEEE conference on computer vision and pattern recognition 2818–2826

  10. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition[C]. Proceedings of the IEEE conference on computer vision and pattern recognition 770–778

  11. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks[C]. Proceedings of the IEEE international conference on computer vision 4489–4497

  12. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space[J]. Computer Science

  13. Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation[C]. Proceedings of the 2014 conference on empirical methods in natural language processing 1532–1543

  14. Kiros R, Zhu Y, Salakhutdinov R R, Zemel R, Urtasun R, Torralba A, Fidleret S (2015) Skip-thought vectors[J]. Advances in neural information processing systems 28:3294-3302

  15. Sethy A, Ramabhadran B (2008) Bag-of-word normalized n-gram models[C]. Ninth Annual Conference of the International Speech Communication Association

  16. Devlin J, Chang M W, Lee K, Toutanova K (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 4171–4186

  17. Gao K, Zhu X, Han Y (2017) Initialized Frame Attention Networks for Video Question Answering[C]. International Conference on Internet Multimedia Computing and Service. Springer, Singapore, pp 349–359

  18. Xu D, Zhao Z, Xiao J, Wu F, Zhang H, He X (2017) Video question answering via gradually refined attention over appearance and motion[C]. Proceedings of the 25th ACM international conference on Multimedia 1645–1653

  19. Tapaswi M, Zhu Y, Stiefelhagen R, Torralba A, Urtasun R, Fidler S (2016) Movieqa: understanding stories in movies through question-answering[C]. Proceedings of the IEEE conference on computer vision and pattern recognition 4631–4640

  20. Yuan Z, Sun S, Duan L, Wu X, Xu C (2019) Adversarial multi-modal network for movie question answering[J]. arXiv preprint arXiv:1906.09844

  21. Jang Y, Song Y, Yu Y, Kim Y, Kim G (2017) Tgif-qa: toward spatio-temporal reasoning in visual question answering[C]. Proceedings of the IEEE conference on computer vision and pattern recognition 2758–2766

  22. Lei J, Yu L, Bansal M, Berg T (2018) Tvqa: localized, compositional video question answering[C]. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 1369–1379

  23. Kim K M, Heo M O, Choi S H, Zhang B T (2017) Deepstory: Video story qa by deep embedded memory networks[C]. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence 2016–2022

  24. Yu Z, Xu D, Yu J, Zhao Z, Zhuang Y (2019) ActivityNet-QA: a dataset for understanding complex web videos via question answering[C]. AAAI 2019: thirty-third AAAI conference on artificial intelligence 33(1):9127–9134

  25. Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input[J]. Advances in Neural Information Processing Systems 27:1682–1690

  26. Wu Q, Teney D, Wang P, Shen C, Dick A, van den Hengel A (2017) Visual question answering: a survey of methods and datasets[J].  Computer Vision and Image Understanding 163:21–40

  27. Kafle K, Kanan C (2017) Visual question answering: datasets, algorithms, and future challenges[J]. Comput Vis Image Underst 163:3–20

  28. Gupta A K (2017) Survey of visual question answering: datasets and techniques[J]. arXiv preprint arXiv:1705.03865

  29. Pandhre S, Sodhani S (2017) Survey of recent advances in visual question answering[J]. arXiv preprint arXiv:1709.08203

  30. Zhang D, Cao R, Wu S (2019) Information fusion in visual question answering: a survey [J]. Information Fusion 52:268–280

  31. Zhu L, Xu Z, Yang Y, Hauptmann AG (2017) Uncovering the temporal context for video question answering[J]. International Journal of Computer Vision 124(3):409–421

  32. Wang Y S, Su H T, Chang C H, Liu Z, Hsu QW H (2019) Video question generation via cross-modal self-attention networks learning[C]. International Conference on Acoustics, Speech and Signal Processing 2423–2427

  33. Jin W, Zhao Z, Li Y et al (2019) Video question answering via knowledge-based progressive spatial-temporal attention network[J]. ACM Trans Multimed Comput Commun Appl (TOMM) 15(2):1–22

  34. Kim J, Ma M, Kim K, Kim S, Yoo C D (2019) Gaining Extra Supervision via Multi-task learning for Multi-Modal Video Question Answering[C]. 2019 International Joint Conference on Neural Networks (IJCNN) IEEE 1–8 

  35. Xiao S, Li Y, Ye Y, Zhao Z, Xiao, Wu F, Zhu J, Zhuang Y (2015) Video question answering via multi-granularity temporal attention network learning[C]. Proceedings of the 10th International Conference on Internet Multimedia Computing and Service 1–5

  36. Xiao S, Li Y, Ye Y, Chen L, Pu S, Zhao Z, Shao J, Xiao J (2020) Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering[J]. Neural Processing Letters 52(2):993–1003

  37. Lei J, Yu L, Berg T L, Bansal M (2020) TVQA+: Spatio-temporal grounding for video question answering[C]. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 8211–8225

  38. Xue H, Zhao Z, Cai D (2017) Unifying the video and question attentions for open-ended video question answering[J]. IEEE Transactions on Image Processing 26(12):5656–5666

  39. Zhao Z, Yang Q, Cai D, He X, Zhuang Y, Zhao Z (2017) Video question answering via hierarchical Spatio-temporal attention networks[C]. twenty-sixth international joint conference on artificial Intelligence 3518–3524

  40. Zhao Z, Zhang Z, Xiao S, Yu Z, Yu J, Cai D, F Wu (2018) Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks[C]. IJCAI 2018: 27th International Joint Conference on Artificial Intelligence 3683–3689

  41. Zhao Z, Zhang Z, Xiao S, Xiao Z, Yan X, Yu J, Cai D (2019) Long-form video question answering via dynamic hierarchical reinforced networks[J]. IEEE Transactions on Image Processing 28(12):5939–5925

  42. Yu Y, Kim J, Kim G (2018) A joint sequence fusion model for video question answering and retrieval[C]. Proceedings of the European Conference on Computer Vision (ECCV) 471–487

  43. Zhao Z, Lin J, Jiang X, Cai D, He X, Zhuang Y (2017) Video question answering via hierarchical dual-level attention network learning[C]. Proceedings of the 25th ACM International Conference on Multimedia 1050–1058 

  44. Xue H, Chu W, Zhao Z, Cai D (2018) A better way to attend: attention with trees for video question answering[J]. IEEE Transactions on Image Processing 27(11):5563–5574

  45. Zhao Z, Jiang X, Cai, XJ, He X, Pu S (2018) Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network[C]. IJCAI 2018: 27th International Joint Conference on Artificial Intelligence, 3690–3696 

  46. Zhao Z, Zhang Z, Jiang X, Cai D (2019) Multi-turn video question answering via hierarchical attention context reinforced networks[J]. IEEE Transactions on Image Processing 28(8):3860–3872

  47. Chu W, Xue H, Zhao Z, Cai D, Yao C (2018) The forgettable-watcher model for video question answering[J]. Neurocomputing 314:386–393

  48. Gao F, Ge Y, Liu Y (2018) Remember and forget: video and text fusion for video question answering [J]. Multimedia Tools and Applications 77(22):29269–29282

  49. Fan C, Zhang X, Zhang S, Wang W, Zhang C, and Huang H  (2019) Heterogeneous memory enhanced multi-modal attention model for video question answering [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1999–2007

  50. Zeng K H, Chen T H, Chuang C Y, Liao Y H, Niebles J C, Sun M (2017) Leveraging video descriptions to learn video question answering[C]. Proceedings of the AAAI Conference on Artificial Intelligence 31(1)

  51. Wang B, Xu Y, Han Y, Hong R  (2018) Movie question answering: remembering the textual cues for layered visual contents[C]. Proceedings of the AAAI Conference on Artificial Intelligence 32(1)

  52. Han Y, Wang B, Hong R, Wu F (2019) Movie question answering via textual memory and plot graph[J]. IEEE Transactions on Circuits and Systems for Video Technology 30(3):875–887

  53. Maharaj T, Ballas N, Rohrbach A, Courville A, Pal C (2017) A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 7359–7368

  54. Ge Y, Xu Y, Han Y (2017) Video question answering using a forget memory network[C]. CCF Chinese Conference on Computer Vision 404–415

  55. Kim J, Ma M, Kim K, Kim S, Yoo C D (2020) Progressive attention memory network for movie story question answering[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 8337–8346

  56. Ye Y, Zhao Z, Li Y, Chen L (2017) Video question answering via attribute-augmented attention network learning[C]. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval 829–832

  57. Song X, Shi Y, Chen X, Han Y (2018) Explore multi-step reasoning in video question answering[C]. Proceedings of the 26th ACM International Conference on Multimedia 239–247

  58. Le T M, Le V, Venkatesh S.Tran (2019) Learning to Reason with Relational Video Representation for Question Answering[J]. CoRR, abs/1907.04553

  59. Yu Y, Ko H, Choi J, Kim G (2017) End-to-end concept word detection for video captioning, retrieval, and question answering[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 3261–3269

  60. Li X, Song J, Gao L, Liu X, Huang W, He X (2019) Beyond rnns: positional self-attention with co-attention for video question answering[C]. Proceedings of the AAAI Conference on Artificial Intelligence 33:8658–8665

  61. Kim K M, Choi S H, Kim J H, Zhang B T (2018) Multi-modal dual attention memory for video story question answering[C]. Proceedings of the European Conference on Computer vision 673–688

  62. Na S, Lee S, Kim J, Kim G A (2017) read-write memory network for movie story understanding[C]. Proceedings of the IEEE International Conference on Computer Vision 677–685

  63. Gao J, Ge R, Chen K, Nevatia R (2018) Motion-appearance co-memory networks for video question answering[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 6576–6585

  64. Zhang Z, Zhao Z, Lin Z, Song J, He X (2019) Open-ended long-form video question answering via hierarchical convolutional self-attention networks[J]. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence 4383–4389

  65. Mun J, Hongsuck Seo P, Jung I, Han B (2017) Marioqa: answering questions by watching gameplay videos[C]. Proceedings of the IEEE International Conference on Computer Vision 2867–2875

  66. Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding with unsupervised learning[J]. Technical report, OpenAI

  67. Peters M E, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations[J]. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1:2227–2237

  68. Mikolov T, Sutskever I, Chen K, Corrado GS (2013) Distributed representations of words and phrases and their compositionality[C]. Advances in neural information processing systems 26:3111–3119

  69. Chao G L, Rastogi A, Yavuz S, Hakkani-Tur T, Chen J, Lane I (2019) Learning question-guided video representation for multi-turn video question answering[J]. Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue 215–225

  70. Fukui A, Park D H, Yang D, Rohrbach A (2016) Multi-modal compact bilinear pooling for visual question answering and visual grounding[J]. arXiv preprint arXiv:1606.01847

  71. Hochreiter S, Schmidhuber J (1997) Long short-term memory[J]. Neural computation 9(8):1735–1780

  72. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks[J]. IEEE transactions on Signal Processing 45(11):2673–2681

  73. Xu Y, Wang L, Cheng J, Xia H, Yin J (2017) DTA: double LSTM with temporal-wise attention network for action recognition [C]. 2017 3rd IEEE International Conference on Computer and Communications 1676–1680

  74. Chung J, Gulcehre C, Cho K H, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling[J]. CoRR, abs/1412.3555

  75. Liujie Z, Yanquan Z, Xiuyu D, Ruiqi C (2018) A Hierarchical multi-input and output Bi-GRU Model for Sentiment Analysis on Customer Reviews[J]. IOP Conf Ser: Mater Sci Eng 322(6):062007

    Google Scholar 

  76. Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]. Proceedings of the IEEE International Conference on Computer Vision 1839–1848

  77. Kim J H, On K W, Lim W, Kim J, Ha JW (2017) Hadamard product for low-rank bilinear pooling[C]. 5th International Conference on Learning Representations

  78. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks[C]. Advances in Neural Information Processing Systems 27:3104–3112

  79. Cho K, Van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder-decoder approaches[J]. Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation 103–111

  80. Cho K, Van Merriënboer B, Gulcehre C, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing 1724–1734

  81. Wang F, Tax D M J (2016) Survey on the attention based RNN model and its applications in computer vision[J]. CoRR, abs/1601.06823

  82. Seo M, Kembhavi A, Farhadi A, Hajishirzi H (2017) Bidirectional attention flow for machine comprehension[C]. 5th International Conference on Learning Representations

  83. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate[J]. 3rd International Conference on Learning Representations

  84. Luong M T, Pham H, Manning C D (2015) Effective approaches to attention-based neural machine translation[J]. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing 1412–1421

  85. Larochelle H, Hinton GE (2010) Learning to combine foveal glimpses with a third-order Boltzmann machine[C]. Advances in Neural Information Processing Systems 23:1243–1251

  86. Fu J, Zheng H, Mei T (2017) Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4438–4446

  87. Vaswani A, Shazeer N, Parmar N, Jones L, Gomez A N, Kaiser Ł, Polosukhin I (2017) Attention is all you need[C]. Advances in Neural Information Processing Systems 5998–6008

  88. Chaudhari S, Polatkan G, Ramanath R, et al (2019) An attentive survey of attention models[J]. CoRR, abs/1904.02874

  89. Hu D (2019) An introductory survey on attention mechanisms in NLP problems[C]. Proceedings of SAI Intelligent Systems Conference 432–448

  90. Duval E (2011) Attention please! Learning analytics for visualization and recommendation[C]. Proceedings of the 1st International Conference on Learning Analytics and Knowledge 9–17

  91. Graves A, Wayne G, Danihelka I(2014) Neural turing machines[J]. arXiv preprint arXiv:1410.5401

  92. Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z (2007) Dbpedia: A nucleus for a web of open data[M], The semantic web. Springer, Berlin, Heidelberg, pp 722–735

  93. Le Q, Mikolov T (2014) Distributed representations of sentences and documents[C]. International Conference on Machine Learning 1188–1196 

  94. Shen T, Zhou T, Long G, Jiang J, Pan S, Zhang C (2018) Disan: directional self-attention network for rnn/cnn-free language understanding[C]. Proceedings of the AAAI Conference on Artificial Intelligence 32(1)

  95. Al-Rfou R, Choe D, Constant N, Guo M, Jones L (2019) Character-level language modeling with deeper self-attention[C]. Proceedings of the AAAI Conference on Artificial Intelligence 33:3159–3166

  96. Huang C Z A, Vaswani A, Uszkoreit J (2018) An improved relative self-attention mechanism for transformer with application to music generation [J]. CoRR abs/1809.04281

  97. Shen T, Zhou T, Long G, Jiang J, Wang S, Zhang C (2018) Reinforced self-attention network: a hybrid of hard and soft attention for sequence modeling[J]. arXiv preprint arXiv:1801.10296

  98. Tang G, Müller M, Rios A, Sennrich R  (2018) Why self-attention? A targeted evaluation of neural machine translation architectures[J]. arXiv preprint arXiv:1808.08946

  99. Du J, Han J, Way A, Wan D (2018) Multi-level structured self-attentions for distantly supervised relation extraction[C]. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2216–2225

  100. Weston J, Chopra S, Bordes A (2015) Memory networks[C]. International Conference on Learning Representations

  101. Sukhbaatar S, Weston J, Fergus R (2015) End-to-end memory networks[C]. Advances in Neural Information Processing Systems 2440–2448

  102. Kumar A, Irsoy O, Ondruska P (2016) Ask me anything: dynamic memory networks for natural language processing[C]. International Conference on Machine Learning 1378–1387

  103. Goodfellow I, Pouget-Abadie J, Mirza M (2014) Generative adversarial nets[C]. Advances in Neural Information Processing Systems 2672–2680

  104. Rohrbach A, Rohrbach M, Tandon N, Schiele B (2015) A dataset for movie description[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 3202–3212

  105. Yu Z, Xu D, Yu J, Yu T, Zhao Z, Zhuang Y, Tao D (2019) ActivityNet-QA: a dataset for understanding complex web videos via question answering[C]. Proceedings of the AAAI Conference on Artificial Intelligence 33:9127–9134

    Article  Google Scholar 

  106. Li Y, Song Y, Cao L (2016) Tetreault J, Goldberg L, Jaimes A, Luo J TGIF: a new dataset and benchmark on animated GIF description[C]. Proc IEEE Conf Comput Vis Pattern Recognit 4641–4650

  107. Wu Z, Palmer M (1994) Verbs semantics and lexical selection[C]. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics 133–138

  108. Fellbaum C (1998) Towards a representation of idioms in WordNet[C]. Processing Systems

  109. Liu C N , Chen D J , Chen H T , Liu T L (2018) A2A: attention to attention reasoning for movie question answering[C]. Asian Conference on Computer Vision 404–419

Download references

Acknowledgements

This research was in part supported by The National Natural Science Foundation of China (No.61702140), The Scientific Research Foundation for The Overseas Returning Person of Heilongjiang Province of China (LC2018030), The Fundamental Research Foundation for Universities of Heilongjiang Province (JMRH2018XM04).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guanglu Sun.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, G., Liang, L., Li, T. et al. Video Question Answering: a Survey of Models and Datasets. Mobile Netw Appl 26, 1904–1937 (2021). https://doi.org/10.1007/s11036-020-01730-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11036-020-01730-0

Keywords

Navigation