Initialized Frame Attention Networks for Video Question Answering

Gao, Kun; Zhu, Xianglei; Han, Yahong

doi:10.1007/978-981-10-8530-7_34

Kun Gao¹²,
Xianglei Zhu¹³ &
Yahong Han¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 819))

Included in the following conference series:

International Conference on Internet Multimedia Computing and Service

1418 Accesses
1 Citations

Abstract

Video Question Answering (Video QA) is one of the important and challenging problems in multimedia and computer vision research. In this paper, we propose a novel framework, called initialized frame attention networks (IFAN). This framework uses long short term memory (LSTM) networks to encode visual information of videos, then initializes the language model by the encoded features. Based on the visual and semantic features, we can get an appropriate answer. In particular, in this IFAN framework, we effectively integrate temporal attention mechanism to focus on the salient frames of videos, which are associated to the questions. In order to verify the effectiveness of the proposed framework, we conduct experiments on TACoS dataset. It achieves good performances on both hard level and easy level of TACoS dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 107.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: grounded question answering in images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4995–5004 (2016)
Google Scholar
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014)
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)
Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)
Google Scholar
Xu, H., Saenko, K.: Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 451–466. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_28
Google Scholar
Shih, K.J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4613–4621 (2016)
Google Scholar
Chen, K., Wang, J., Chen, L.C., Gao, H., Xu, W., Nevatia, R.: ABC-CNN: an attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960 (2015)
Tu, K., Meng, M., Lee, M.W., Choe, T.E., Zhu, S.C.: Joint video and text parsing for understanding events and answering queries. IEEE MultiMed. 21(2), 42–70 (2014)
Article Google Scholar
Zhu, L., Xu, Z., Yang, Y., Hauptmann, A.G.: Uncovering temporal context for video question and answering. arXiv preprint arXiv:1511.04670 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
Google Scholar
Ma, L., Lu, Z., Li, H.: Learning to answer questions from image using convolutional neural network. In: AAAI, p. 16 (2016)
Google Scholar
Zeng, K.H., Chen, T.H., Chuang, C.Y., Liao, Y.H., Niebles, J.C., Sun, M.: Leveraging video descriptions to learn video question answering. In: AAAI, pp. 4334–4340 (2017)
Google Scholar
Vilariño, D.L., Brea, V.M., Cabello, D., Pardo, J.M.: Discrete-time CNN for image segmentation by active contours. Pattern Recogn. Lett. 19, 721–734 (1998)
Article MATH Google Scholar
Hong, R., Zhang, L., Zhang, C., Zimmermann, R.: Flickr circles: aesthetic tendency discovery by multi-view regularized topic modeling. IEEE Trans. Multimed. 18, 1555–1567 (2016)
Article Google Scholar
Hong, R., Hu, Z., Wang, R., Wang, M., Tao, D.: Multi-view object retrieval via multi-scale topic models. IEEE Trans. Image Process. 25, 5814–5827 (2016)
Article MathSciNet Google Scholar
Hong, R., Yang, Y., Wang, M., Hua, X.S.: Learning visual semantic relationships for efficient visual retrieval. IEEE Trans. Big Data 1, 152–161 (2015)
Article Google Scholar
Yang, Z., Han, Y., Wang Z.: Catching the temporal regions-of-interest for video captioning. In: Proceedings of the ACM International Conference on Multimedia, MM 2017. ACM (2017)
Google Scholar
Li, G., Ma, S., Han, Y.: Summarization-based video caption via deep neural networks. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1191–1194 (2015)
Google Scholar

Download references

Acknowledgment

This work was supported by the NSFC (under Grant U1509206, 61472276).

Author information

Authors and Affiliations

School of Computer Science and Technology, Tianjin University, Tianjin, China
Kun Gao & Yahong Han
China Automotive Technology and Research Center, Tianjin, China
Xianglei Zhu

Authors

Kun Gao
View author publications
You can also search for this author in PubMed Google Scholar
Xianglei Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yahong Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kun Gao .

Editor information

Editors and Affiliations

Multimedia Communications Department, EURECOM, Sophia Antipolis, France
Benoit Huet
Shandong University , Qingdao, China
Liqiang Nie
Hefei University of Technology , Hefei, China
Richang Hong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gao, K., Zhu, X., Han, Y. (2018). Initialized Frame Attention Networks for Video Question Answering. In: Huet, B., Nie, L., Hong, R. (eds) Internet Multimedia Computing and Service. ICIMCS 2017. Communications in Computer and Information Science, vol 819. Springer, Singapore. https://doi.org/10.1007/978-981-10-8530-7_34

Download citation

DOI: https://doi.org/10.1007/978-981-10-8530-7_34
Published: 01 March 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8529-1
Online ISBN: 978-981-10-8530-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics