Skip to main content
Log in

Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

This work aims to address the problem of video question answering (VideoQA) with a novel model and a new open-ended VideoQA dataset. VideoQA is a challenging field in visual information retrieval, which aims to generate the answer according to the video content and question. Ultimately, VideoQA is a video understanding task. Efficiently combining the multi-grained representations is the key factor in understanding a video. The existing works mostly focus on overall frame-level visual understanding to tackle the problem, which neglects finer-grained and temporal information inside the video, or just combines the multi-grained representations simply by concatenation or addition. Thus, we propose the multi-granularity temporal attention network that enables to search for the specific frames in a video that are holistically and locally related to the answer. We first learn the mutual attention representations of multi-grained visual content and question. Then the mutually attended features are combined hierarchically using a double layer LSTM to generate the answer. Furthermore, we illustrate several different multi-grained fusion configurations to prove the advancement of this hierarchical architecture. The effectiveness of our model is demonstrated on the large-scale video question answering dataset based on ActivityNet dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Agrawal A, Lu J, Antol S, Mitchell M, Zitnick CL, Parikh D, Batra D (2017) VQA: visual question answering. Int J Compu Vis 123:431

    Article  MathSciNet  Google Scholar 

  2. Kafle K, Kanan C (2016) Answer-type prediction for visual question answering. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 4976–4984

  3. Teney D, Liu L, Van Den Hengel A (2017) Graph-structured representations for visual question answering. In: IEEE conference on computer vision and pattern recognition, pp 3233–3241

  4. Yang Z, He X, Gao J, Deng L, Smola A (2015) Stacked attention networks for image question answering. CVPR 2016:21–29

    Google Scholar 

  5. Kafle K, Kanan C (2017) Visual question answering: datasets, algorithms, and future challenges. Comput Vis Image Underst 163:320

    Article  Google Scholar 

  6. Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. CVPR 2017:4187–4195

    Google Scholar 

  7. Zhang H, Zha Z, Yang Y (2013) Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval. In: Proceedings of the 21st ACM international conference on multimedia, pp 33–42

  8. Hong R, Li L, Cai J (2017) Coherent semantic-visual indexing for large-scale image retrieval in the cloud. IEEE Trans Image Process 26(9):4128–4138

    Article  MathSciNet  Google Scholar 

  9. Ye Y, Zhao Z, Li Y, Chen L, Xiao J, Zhuang Y (2017) Video question answering via attribute-augmented attention network learning. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. ACM, Cambridge, pp 829–832

  10. Xu D, Zhao Z, Xiao J, Wu F, Zhang H, He X, Zhuang Y (2017) Video question answering via gradually refined attention over appearance and motion. In: ACM multimedia conference, pp 1645–1653

  11. Zhao Z, Yang Q, Cai D, He X, Zhuang Y (2017) Video question answering via hierarchical spatio-temporal attention networks. In: IJCAI international joint conference on artificial intelligence, pp 3518–3524

  12. Zhu L, Xu Z, Yang Y, Hauptmann AG (2017) Uncovering the temporal context for video question answering. Int J Comput Vis 124(3):409–421

    Article  MathSciNet  Google Scholar 

  13. Hong R, Wang M, Li G (2012) Multimedia question answering. IEEE Multimedia 19(4):72–78

    Article  Google Scholar 

  14. Song J, Gao L, Guo Z, Liu W, Zhang D, Shen HT (2017) Hierarchical LSTM with adjusted temporal attention for video captioning. In: IJCAI international joint conference on artificial intelligence, pp 2737–2743

  15. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: ICML

  16. Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: ICCV

  17. Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: CVPR

  18. Chen K, Wang J, Chen L-C, Gao H, Xu W, Nevatia R (2016) ABCCNN: an attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960

  19. Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: CVPR

  20. Xu H, Saenko K (2016) Ask, attend and answer: exploring question- guided spatial attention for visual question answering. In: ECCV

  21. Zeng K-H, Chen T-H, Chuang C-Y, Liao, Y-H, Niebles JC, Sun M (2016) Leveraging video descriptions to learn video question answering. arXiv preprint arXiv:1611.04021

  22. Backhouse J (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  23. Zhao S, Liu Y, Han Y (2018) Pooling the convolutional layers in deep convnets for video action recognition. IEEE Trans Circ Syst Video Technol 28(8):1839–1849

    Article  Google Scholar 

  24. Gao Z, Zhang H, Xu G (2015) Multi-perspective and multi-modality joint representation and recognition model for 3D action recognition. Neurocomputing 151(2):554–564

    Article  Google Scholar 

  25. Xiao S, Li Y, Ye Y (2018) Video question answering via multi-granularity temporal attention network learning. In: ICIMCS

  26. Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. arXiv Preprint arXiv:1606.00061

  27. Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp. 961–970

  28. Kong T, Yao A, Chen Y, Sun F (2016) Hypernet: towards accurate region proposal generation and joint object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 845–853

  29. Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with. IEEE Trans Pattern Anal Mach Intel 39(6):1137–1149

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by Zhejiang Natural Science Foundation (LR19F020002, LZ17F020001), National Natural Science Foundation of China (61572431), Key R&D Program of Zhejiang Province (2018C01006), Chinese Knowledge Center for Engineering Sciences and Technology and Joint Research Program of ZJU and Hikvision Research Institute.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Xiao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xiao, S., Li, Y., Ye, Y. et al. Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering. Neural Process Lett 52, 993–1003 (2020). https://doi.org/10.1007/s11063-019-10003-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-019-10003-1

Keywords

Navigation