Abstract
This work aims to address the problem of video question answering (VideoQA) with a novel model and a new open-ended VideoQA dataset. VideoQA is a challenging field in visual information retrieval, which aims to generate the answer according to the video content and question. Ultimately, VideoQA is a video understanding task. Efficiently combining the multi-grained representations is the key factor in understanding a video. The existing works mostly focus on overall frame-level visual understanding to tackle the problem, which neglects finer-grained and temporal information inside the video, or just combines the multi-grained representations simply by concatenation or addition. Thus, we propose the multi-granularity temporal attention network that enables to search for the specific frames in a video that are holistically and locally related to the answer. We first learn the mutual attention representations of multi-grained visual content and question. Then the mutually attended features are combined hierarchically using a double layer LSTM to generate the answer. Furthermore, we illustrate several different multi-grained fusion configurations to prove the advancement of this hierarchical architecture. The effectiveness of our model is demonstrated on the large-scale video question answering dataset based on ActivityNet dataset.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Agrawal A, Lu J, Antol S, Mitchell M, Zitnick CL, Parikh D, Batra D (2017) VQA: visual question answering. Int J Compu Vis 123:431
Kafle K, Kanan C (2016) Answer-type prediction for visual question answering. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 4976–4984
Teney D, Liu L, Van Den Hengel A (2017) Graph-structured representations for visual question answering. In: IEEE conference on computer vision and pattern recognition, pp 3233–3241
Yang Z, He X, Gao J, Deng L, Smola A (2015) Stacked attention networks for image question answering. CVPR 2016:21–29
Kafle K, Kanan C (2017) Visual question answering: datasets, algorithms, and future challenges. Comput Vis Image Underst 163:320
Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. CVPR 2017:4187–4195
Zhang H, Zha Z, Yang Y (2013) Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval. In: Proceedings of the 21st ACM international conference on multimedia, pp 33–42
Hong R, Li L, Cai J (2017) Coherent semantic-visual indexing for large-scale image retrieval in the cloud. IEEE Trans Image Process 26(9):4128–4138
Ye Y, Zhao Z, Li Y, Chen L, Xiao J, Zhuang Y (2017) Video question answering via attribute-augmented attention network learning. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. ACM, Cambridge, pp 829–832
Xu D, Zhao Z, Xiao J, Wu F, Zhang H, He X, Zhuang Y (2017) Video question answering via gradually refined attention over appearance and motion. In: ACM multimedia conference, pp 1645–1653
Zhao Z, Yang Q, Cai D, He X, Zhuang Y (2017) Video question answering via hierarchical spatio-temporal attention networks. In: IJCAI international joint conference on artificial intelligence, pp 3518–3524
Zhu L, Xu Z, Yang Y, Hauptmann AG (2017) Uncovering the temporal context for video question answering. Int J Comput Vis 124(3):409–421
Hong R, Wang M, Li G (2012) Multimedia question answering. IEEE Multimedia 19(4):72–78
Song J, Gao L, Guo Z, Liu W, Zhang D, Shen HT (2017) Hierarchical LSTM with adjusted temporal attention for video captioning. In: IJCAI international joint conference on artificial intelligence, pp 2737–2743
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: ICML
Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: ICCV
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: CVPR
Chen K, Wang J, Chen L-C, Gao H, Xu W, Nevatia R (2016) ABCCNN: an attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: CVPR
Xu H, Saenko K (2016) Ask, attend and answer: exploring question- guided spatial attention for visual question answering. In: ECCV
Zeng K-H, Chen T-H, Chuang C-Y, Liao, Y-H, Niebles JC, Sun M (2016) Leveraging video descriptions to learn video question answering. arXiv preprint arXiv:1611.04021
Backhouse J (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Zhao S, Liu Y, Han Y (2018) Pooling the convolutional layers in deep convnets for video action recognition. IEEE Trans Circ Syst Video Technol 28(8):1839–1849
Gao Z, Zhang H, Xu G (2015) Multi-perspective and multi-modality joint representation and recognition model for 3D action recognition. Neurocomputing 151(2):554–564
Xiao S, Li Y, Ye Y (2018) Video question answering via multi-granularity temporal attention network learning. In: ICIMCS
Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. arXiv Preprint arXiv:1606.00061
Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp. 961–970
Kong T, Yao A, Chen Y, Sun F (2016) Hypernet: towards accurate region proposal generation and joint object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 845–853
Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with. IEEE Trans Pattern Anal Mach Intel 39(6):1137–1149
Acknowledgements
This work was supported by Zhejiang Natural Science Foundation (LR19F020002, LZ17F020001), National Natural Science Foundation of China (61572431), Key R&D Program of Zhejiang Province (2018C01006), Chinese Knowledge Center for Engineering Sciences and Technology and Joint Research Program of ZJU and Hikvision Research Institute.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Xiao, S., Li, Y., Ye, Y. et al. Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering. Neural Process Lett 52, 993–1003 (2020). https://doi.org/10.1007/s11063-019-10003-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-019-10003-1