Video Question Answering: a Survey of Models and Datasets

Sun, Guanglu; Liang, Lili; Li, Tianlin; Yu, Bo; Wu, Meng; Zhang, Bolun

doi:10.1007/s11036-020-01730-0

Video Question Answering: a Survey of Models and Datasets

Published: 25 January 2021

Volume 26, pages 1904–1937, (2021)
Cite this article

Mobile Networks and Applications Aims and scope Submit manuscript

Guanglu Sun¹,
Lili Liang¹,
Tianlin Li¹,
Bo Yu¹,
Meng Wu¹ &
…
Bolun Zhang¹

2169 Accesses
11 Citations
Explore all metrics

Abstract

Video question answering (VideoQA) automatically answers natural language question according to the content of videos. It promotes the development of online education, scenario analysis, video content retrieving, etc. VideoQA is a challenging task because it requires a model to understand semantic information of the video and the question to generate the answer. Firstly, we propose a general framework of VideoQA which consists of a video feature extraction module, a text feature extraction module, an integration module, and an answer generation module. The integration module is the core module, including core processing model, recurrent neural networks (RNNs) encoder and feature fusion. These three sub-modules cooperate to generate the contextual representation, and the answer generation module generates the answer based on it. Then, we summarize the methods in core processing model, and introduce the ideas and applications of the methods in detail, such as encoder-decoder, attention model, and memory network and other methods. Additionally, we introduce the widely used datasets and evaluation criteria, as well as the analysis of experimental results on benchmark datasets. Finally, we discuss challenges in the field of VideoQA and provide some possible directions for future work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Uncovering the Temporal Context for Video Question Answering

Article 13 July 2017

Linchao Zhu, Zhongwen Xu, … Alexander G. Hauptmann

Video question answering supported by a multi-task learning objective

Article Open access 24 March 2023

Alex Falcon, Giuseppe Serra & Oswald Lanz

Hierarchical Recurrent Contextual Attention Network for Video Question Answering

References

Lin T Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection[C]. Proceedings of the IEEE conference on computer vision and pattern recognition, 2117–2125
Maninis K K , Caelles S , Chen Y (2018) Video object segmentation without temporal information[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(6):1515–1530
Wong W K , Lai Z, Wen J, Fang X, Lu Y (2017) Low-rank embedding for robust image feature extraction[J]. IEEE Transactions on Image Processing 26(6): 2905–2917
Lu J, Peng Y, Qi GJ, Jun Y (2020) Guest editorial introduction to the special section on representation learning for visual content understanding[J]. IEEE Transactions on Circuits and Systems for Video Technology 30(9):2797–2800
Anjum A, Abdullah T, Tariq MF, Baltaci Y, Antonopoulos N (2019) Video stream analysis in clouds: an object detection and classification framework for high performance video analytics [J]. IEEE Trans Cloud Comput 7(4):1152–1167
Ren M, Kiros R, Zemel R (2015) Image Question Answering: A Visual Semantic Embedding Model and a New Dataset[J]. Proc. Advances in Neural Inf. Process. Syst 1(2):5.
Ren S, He K, Girshick R, Sun J (2016) Faster r-cnn: Towards real-time object detection with region proposal networks[J]. IEEE transactions on pattern analysis and machine intelligence 39(6):1137–1149
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition[J]. Computer Science
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision[C]. Proceedings of the IEEE conference on computer vision and pattern recognition 2818–2826
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition[C]. Proceedings of the IEEE conference on computer vision and pattern recognition 770–778
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks[C]. Proceedings of the IEEE international conference on computer vision 4489–4497
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space[J]. Computer Science
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation[C]. Proceedings of the 2014 conference on empirical methods in natural language processing 1532–1543
Kiros R, Zhu Y, Salakhutdinov R R, Zemel R, Urtasun R, Torralba A, Fidleret S (2015) Skip-thought vectors[J]. Advances in neural information processing systems 28:3294-3302
Sethy A, Ramabhadran B (2008) Bag-of-word normalized n-gram models[C]. Ninth Annual Conference of the International Speech Communication Association
Devlin J, Chang M W, Lee K, Toutanova K (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 4171–4186
Gao K, Zhu X, Han Y (2017) Initialized Frame Attention Networks for Video Question Answering[C]. International Conference on Internet Multimedia Computing and Service. Springer, Singapore, pp 349–359
Xu D, Zhao Z, Xiao J, Wu F, Zhang H, He X (2017) Video question answering via gradually refined attention over appearance and motion[C]. Proceedings of the 25th ACM international conference on Multimedia 1645–1653
Tapaswi M, Zhu Y, Stiefelhagen R, Torralba A, Urtasun R, Fidler S (2016) Movieqa: understanding stories in movies through question-answering[C]. Proceedings of the IEEE conference on computer vision and pattern recognition 4631–4640
Yuan Z, Sun S, Duan L, Wu X, Xu C (2019) Adversarial multi-modal network for movie question answering[J]. arXiv preprint arXiv:1906.09844
Jang Y, Song Y, Yu Y, Kim Y, Kim G (2017) Tgif-qa: toward spatio-temporal reasoning in visual question answering[C]. Proceedings of the IEEE conference on computer vision and pattern recognition 2758–2766
Lei J, Yu L, Bansal M, Berg T (2018) Tvqa: localized, compositional video question answering[C]. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 1369–1379
Kim K M, Heo M O, Choi S H, Zhang B T (2017) Deepstory: Video story qa by deep embedded memory networks[C]. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence 2016–2022
Yu Z, Xu D, Yu J, Zhao Z, Zhuang Y (2019) ActivityNet-QA: a dataset for understanding complex web videos via question answering[C]. AAAI 2019: thirty-third AAAI conference on artificial intelligence 33(1):9127–9134
Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input[J]. Advances in Neural Information Processing Systems 27:1682–1690
Wu Q, Teney D, Wang P, Shen C, Dick A, van den Hengel A (2017) Visual question answering: a survey of methods and datasets[J]. Computer Vision and Image Understanding 163:21–40
Kafle K, Kanan C (2017) Visual question answering: datasets, algorithms, and future challenges[J]. Comput Vis Image Underst 163:3–20
Gupta A K (2017) Survey of visual question answering: datasets and techniques[J]. arXiv preprint arXiv:1705.03865
Pandhre S, Sodhani S (2017) Survey of recent advances in visual question answering[J]. arXiv preprint arXiv:1709.08203
Zhang D, Cao R, Wu S (2019) Information fusion in visual question answering: a survey [J]. Information Fusion 52:268–280
Zhu L, Xu Z, Yang Y, Hauptmann AG (2017) Uncovering the temporal context for video question answering[J]. International Journal of Computer Vision 124(3):409–421
Wang Y S, Su H T, Chang C H, Liu Z, Hsu QW H (2019) Video question generation via cross-modal self-attention networks learning[C]. International Conference on Acoustics, Speech and Signal Processing 2423–2427
Jin W, Zhao Z, Li Y et al (2019) Video question answering via knowledge-based progressive spatial-temporal attention network[J]. ACM Trans Multimed Comput Commun Appl (TOMM) 15(2):1–22
Kim J, Ma M, Kim K, Kim S, Yoo C D (2019) Gaining Extra Supervision via Multi-task learning for Multi-Modal Video Question Answering[C]. 2019 International Joint Conference on Neural Networks (IJCNN) IEEE 1–8
Xiao S, Li Y, Ye Y, Zhao Z, Xiao, Wu F, Zhu J, Zhuang Y (2015) Video question answering via multi-granularity temporal attention network learning[C]. Proceedings of the 10th International Conference on Internet Multimedia Computing and Service 1–5
Xiao S, Li Y, Ye Y, Chen L, Pu S, Zhao Z, Shao J, Xiao J (2020) Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering[J]. Neural Processing Letters 52(2):993–1003
Lei J, Yu L, Berg T L, Bansal M (2020) TVQA+: Spatio-temporal grounding for video question answering[C]. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 8211–8225
Xue H, Zhao Z, Cai D (2017) Unifying the video and question attentions for open-ended video question answering[J]. IEEE Transactions on Image Processing 26(12):5656–5666
Zhao Z, Yang Q, Cai D, He X, Zhuang Y, Zhao Z (2017) Video question answering via hierarchical Spatio-temporal attention networks[C]. twenty-sixth international joint conference on artificial Intelligence 3518–3524
Zhao Z, Zhang Z, Xiao S, Yu Z, Yu J, Cai D, F Wu (2018) Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks[C]. IJCAI 2018: 27th International Joint Conference on Artificial Intelligence 3683–3689
Zhao Z, Zhang Z, Xiao S, Xiao Z, Yan X, Yu J, Cai D (2019) Long-form video question answering via dynamic hierarchical reinforced networks[J]. IEEE Transactions on Image Processing 28(12):5939–5925
Yu Y, Kim J, Kim G (2018) A joint sequence fusion model for video question answering and retrieval[C]. Proceedings of the European Conference on Computer Vision (ECCV) 471–487
Zhao Z, Lin J, Jiang X, Cai D, He X, Zhuang Y (2017) Video question answering via hierarchical dual-level attention network learning[C]. Proceedings of the 25th ACM International Conference on Multimedia 1050–1058
Xue H, Chu W, Zhao Z, Cai D (2018) A better way to attend: attention with trees for video question answering[J]. IEEE Transactions on Image Processing 27(11):5563–5574
Zhao Z, Jiang X, Cai, XJ, He X, Pu S (2018) Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network[C]. IJCAI 2018: 27th International Joint Conference on Artificial Intelligence, 3690–3696
Zhao Z, Zhang Z, Jiang X, Cai D (2019) Multi-turn video question answering via hierarchical attention context reinforced networks[J]. IEEE Transactions on Image Processing 28(8):3860–3872
Chu W, Xue H, Zhao Z, Cai D, Yao C (2018) The forgettable-watcher model for video question answering[J]. Neurocomputing 314:386–393
Gao F, Ge Y, Liu Y (2018) Remember and forget: video and text fusion for video question answering [J]. Multimedia Tools and Applications 77(22):29269–29282
Fan C, Zhang X, Zhang S, Wang W, Zhang C, and Huang H (2019) Heterogeneous memory enhanced multi-modal attention model for video question answering [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1999–2007
Zeng K H, Chen T H, Chuang C Y, Liao Y H, Niebles J C, Sun M (2017) Leveraging video descriptions to learn video question answering[C]. Proceedings of the AAAI Conference on Artificial Intelligence 31(1)
Wang B, Xu Y, Han Y, Hong R (2018) Movie question answering: remembering the textual cues for layered visual contents[C]. Proceedings of the AAAI Conference on Artificial Intelligence 32(1)
Han Y, Wang B, Hong R, Wu F (2019) Movie question answering via textual memory and plot graph[J]. IEEE Transactions on Circuits and Systems for Video Technology 30(3):875–887
Maharaj T, Ballas N, Rohrbach A, Courville A, Pal C (2017) A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 7359–7368
Ge Y, Xu Y, Han Y (2017) Video question answering using a forget memory network[C]. CCF Chinese Conference on Computer Vision 404–415
Kim J, Ma M, Kim K, Kim S, Yoo C D (2020) Progressive attention memory network for movie story question answering[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 8337–8346
Ye Y, Zhao Z, Li Y, Chen L (2017) Video question answering via attribute-augmented attention network learning[C]. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval 829–832
Song X, Shi Y, Chen X, Han Y (2018) Explore multi-step reasoning in video question answering[C]. Proceedings of the 26th ACM International Conference on Multimedia 239–247
Le T M, Le V, Venkatesh S.Tran (2019) Learning to Reason with Relational Video Representation for Question Answering[J]. CoRR, abs/1907.04553
Yu Y, Ko H, Choi J, Kim G (2017) End-to-end concept word detection for video captioning, retrieval, and question answering[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 3261–3269
Li X, Song J, Gao L, Liu X, Huang W, He X (2019) Beyond rnns: positional self-attention with co-attention for video question answering[C]. Proceedings of the AAAI Conference on Artificial Intelligence 33:8658–8665
Kim K M, Choi S H, Kim J H, Zhang B T (2018) Multi-modal dual attention memory for video story question answering[C]. Proceedings of the European Conference on Computer vision 673–688
Na S, Lee S, Kim J, Kim G A (2017) read-write memory network for movie story understanding[C]. Proceedings of the IEEE International Conference on Computer Vision 677–685
Gao J, Ge R, Chen K, Nevatia R (2018) Motion-appearance co-memory networks for video question answering[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 6576–6585
Zhang Z, Zhao Z, Lin Z, Song J, He X (2019) Open-ended long-form video question answering via hierarchical convolutional self-attention networks[J]. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence 4383–4389
Mun J, Hongsuck Seo P, Jung I, Han B (2017) Marioqa: answering questions by watching gameplay videos[C]. Proceedings of the IEEE International Conference on Computer Vision 2867–2875
Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding with unsupervised learning[J]. Technical report, OpenAI
Peters M E, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations[J]. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1:2227–2237
Mikolov T, Sutskever I, Chen K, Corrado GS (2013) Distributed representations of words and phrases and their compositionality[C]. Advances in neural information processing systems 26:3111–3119
Chao G L, Rastogi A, Yavuz S, Hakkani-Tur T, Chen J, Lane I (2019) Learning question-guided video representation for multi-turn video question answering[J]. Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue 215–225
Fukui A, Park D H, Yang D, Rohrbach A (2016) Multi-modal compact bilinear pooling for visual question answering and visual grounding[J]. arXiv preprint arXiv:1606.01847
Hochreiter S, Schmidhuber J (1997) Long short-term memory[J]. Neural computation 9(8):1735–1780
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks[J]. IEEE transactions on Signal Processing 45(11):2673–2681
Xu Y, Wang L, Cheng J, Xia H, Yin J (2017) DTA: double LSTM with temporal-wise attention network for action recognition [C]. 2017 3rd IEEE International Conference on Computer and Communications 1676–1680
Chung J, Gulcehre C, Cho K H, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling[J]. CoRR, abs/1412.3555
Liujie Z, Yanquan Z, Xiuyu D, Ruiqi C (2018) A Hierarchical multi-input and output Bi-GRU Model for Sentiment Analysis on Customer Reviews[J]. IOP Conf Ser: Mater Sci Eng 322(6):062007
Google Scholar
Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]. Proceedings of the IEEE International Conference on Computer Vision 1839–1848
Kim J H, On K W, Lim W, Kim J, Ha JW (2017) Hadamard product for low-rank bilinear pooling[C]. 5th International Conference on Learning Representations
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks[C]. Advances in Neural Information Processing Systems 27:3104–3112
Cho K, Van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder-decoder approaches[J]. Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation 103–111
Cho K, Van Merriënboer B, Gulcehre C, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing 1724–1734
Wang F, Tax D M J (2016) Survey on the attention based RNN model and its applications in computer vision[J]. CoRR, abs/1601.06823
Seo M, Kembhavi A, Farhadi A, Hajishirzi H (2017) Bidirectional attention flow for machine comprehension[C]. 5th International Conference on Learning Representations
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate[J]. 3rd International Conference on Learning Representations
Luong M T, Pham H, Manning C D (2015) Effective approaches to attention-based neural machine translation[J]. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing 1412–1421
Larochelle H, Hinton GE (2010) Learning to combine foveal glimpses with a third-order Boltzmann machine[C]. Advances in Neural Information Processing Systems 23:1243–1251
Fu J, Zheng H, Mei T (2017) Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4438–4446
Vaswani A, Shazeer N, Parmar N, Jones L, Gomez A N, Kaiser Ł, Polosukhin I (2017) Attention is all you need[C]. Advances in Neural Information Processing Systems 5998–6008
Chaudhari S, Polatkan G, Ramanath R, et al (2019) An attentive survey of attention models[J]. CoRR, abs/1904.02874
Hu D (2019) An introductory survey on attention mechanisms in NLP problems[C]. Proceedings of SAI Intelligent Systems Conference 432–448
Duval E (2011) Attention please! Learning analytics for visualization and recommendation[C]. Proceedings of the 1st International Conference on Learning Analytics and Knowledge 9–17
Graves A, Wayne G, Danihelka I(2014) Neural turing machines[J]. arXiv preprint arXiv:1410.5401
Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z (2007) Dbpedia: A nucleus for a web of open data[M], The semantic web. Springer, Berlin, Heidelberg, pp 722–735
Le Q, Mikolov T (2014) Distributed representations of sentences and documents[C]. International Conference on Machine Learning 1188–1196
Shen T, Zhou T, Long G, Jiang J, Pan S, Zhang C (2018) Disan: directional self-attention network for rnn/cnn-free language understanding[C]. Proceedings of the AAAI Conference on Artificial Intelligence 32(1)
Al-Rfou R, Choe D, Constant N, Guo M, Jones L (2019) Character-level language modeling with deeper self-attention[C]. Proceedings of the AAAI Conference on Artificial Intelligence 33:3159–3166
Huang C Z A, Vaswani A, Uszkoreit J (2018) An improved relative self-attention mechanism for transformer with application to music generation [J]. CoRR abs/1809.04281
Shen T, Zhou T, Long G, Jiang J, Wang S, Zhang C (2018) Reinforced self-attention network: a hybrid of hard and soft attention for sequence modeling[J]. arXiv preprint arXiv:1801.10296
Tang G, Müller M, Rios A, Sennrich R (2018) Why self-attention? A targeted evaluation of neural machine translation architectures[J]. arXiv preprint arXiv:1808.08946
Du J, Han J, Way A, Wan D (2018) Multi-level structured self-attentions for distantly supervised relation extraction[C]. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2216–2225
Weston J, Chopra S, Bordes A (2015) Memory networks[C]. International Conference on Learning Representations
Sukhbaatar S, Weston J, Fergus R (2015) End-to-end memory networks[C]. Advances in Neural Information Processing Systems 2440–2448
Kumar A, Irsoy O, Ondruska P (2016) Ask me anything: dynamic memory networks for natural language processing[C]. International Conference on Machine Learning 1378–1387
Goodfellow I, Pouget-Abadie J, Mirza M (2014) Generative adversarial nets[C]. Advances in Neural Information Processing Systems 2672–2680
Rohrbach A, Rohrbach M, Tandon N, Schiele B (2015) A dataset for movie description[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 3202–3212
Yu Z, Xu D, Yu J, Yu T, Zhao Z, Zhuang Y, Tao D (2019) ActivityNet-QA: a dataset for understanding complex web videos via question answering[C]. Proceedings of the AAAI Conference on Artificial Intelligence 33:9127–9134
Article Google Scholar
Li Y, Song Y, Cao L (2016) Tetreault J, Goldberg L, Jaimes A, Luo J TGIF: a new dataset and benchmark on animated GIF description[C]. Proc IEEE Conf Comput Vis Pattern Recognit 4641–4650
Wu Z, Palmer M (1994) Verbs semantics and lexical selection[C]. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics 133–138
Fellbaum C (1998) Towards a representation of idioms in WordNet[C]. Processing Systems
Liu C N , Chen D J , Chen H T , Liu T L (2018) A2A: attention to attention reasoning for movie question answering[C]. Asian Conference on Computer Vision 404–419

Download references

Acknowledgements

This research was in part supported by The National Natural Science Foundation of China (No.61702140), The Scientific Research Foundation for The Overseas Returning Person of Heilongjiang Province of China (LC2018030), The Fundamental Research Foundation for Universities of Heilongjiang Province (JMRH2018XM04).

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin University of Science and Technology, Harbin, 150080, China
Guanglu Sun, Lili Liang, Tianlin Li, Bo Yu, Meng Wu & Bolun Zhang

Authors

Guanglu Sun
View author publications
You can also search for this author in PubMed Google Scholar
Lili Liang
View author publications
You can also search for this author in PubMed Google Scholar
Tianlin Li
View author publications
You can also search for this author in PubMed Google Scholar
Bo Yu
View author publications
You can also search for this author in PubMed Google Scholar
Meng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Bolun Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guanglu Sun.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, G., Liang, L., Li, T. et al. Video Question Answering: a Survey of Models and Datasets. Mobile Netw Appl 26, 1904–1937 (2021). https://doi.org/10.1007/s11036-020-01730-0

Download citation

Accepted: 09 December 2020
Published: 25 January 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s11036-020-01730-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video Question Answering: a Survey of Models and Datasets

Abstract

Access this article

Similar content being viewed by others

Uncovering the Temporal Context for Video Question Answering

Video question answering supported by a multi-task learning objective

Hierarchical Recurrent Contextual Attention Network for Video Question Answering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Video Question Answering: a Survey of Models and Datasets

Abstract

Access this article

Similar content being viewed by others

Uncovering the Temporal Context for Video Question Answering

Video question answering supported by a multi-task learning objective

Hierarchical Recurrent Contextual Attention Network for Video Question Answering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation