Skip to main content

Video Captioning Using Attention Based Visual Fusion with Bi-temporal Context and Bi-modal Semantic Feature Learning

  • Conference paper
  • First Online:
Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2020 (AISI 2020)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1261))

  • 3712 Accesses

Abstract

Video captioning is a recent emerging task that describes a video through generating a natural language sentence. Practically videos are untrimmed where both localizing and describing the event of interest is crucial for many vision based- real life applications. This paper proposes a deep neural network framework for effective video event localization through using a bidirectional Long Short Term Memory (LSTM) that encodes past, current and future context information. Our framework adopts an encoder decoder network that accepts the event proposal with highest temporal intersection with ground truth for captioning. Our encoder is fed with attentively fused visual features, extracted by a two stream 3D convolution neural network, along with the proposal’s context information for generating an effective representation. Our decoder accepts learnt semantic features that represent bi-modal (two modes) high-level semantic concepts. We conduct experiments to demonstrate that utilizing both semantic features and contextual information provides better captioning performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, pp. 5288–5296 (2016)

    Google Scholar 

  2. Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 1494–1504. North American Chapter of the Association for Computational Linguistics (NAACL), Colorado (2015)

    Google Scholar 

  3. Mahdisoltani, F., Berger, G., Gharbieh, W., Fleet, D., Memisevic, R.: Fine-grained video classification and captioning. ArXiv_CV (2018)

    Google Scholar 

  4. Buch, S., Escorcia, V., Shen, C., Ghanem, B., Carlos Niebles, J.: SST: single-stream temporal action proposals. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, pp. 6373–6382 (2017)

    Google Scholar 

  5. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City , pp. 6546–6555 (2018)

    Google Scholar 

  6. Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3D residual networks for action recognition. In: IEEE International Conference on Computer Vision Workshop (ICCVW), pp. 3154–3160 (2017)

    Google Scholar 

  7. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, pp. 4724–4733 (2017)

    Google Scholar 

  8. Wang, X., Miao, Z., Zhang, R., Hao, S.: I3D-LSTM: a new model for human action recognition. In: IOP Conference Series: Materials Science and Engineering (2019)

    Google Scholar 

  9. Kay, W., et al.: The kinetics human action video dataset. ArXiv (2017)

    Google Scholar 

  10. Soomro, K., Zamir, A.R., Shah, M.: A dataset of 101 human actions classes from videos in the wild. ArXiv (2012)

    Google Scholar 

  11. Zhao, Y., Yang, R., Chevalier, G., Xu, X., Zhang, Z.: Deep residual bidir-LSTM for human activity recognition using wearable sensors. Math. Prob. Eng. 1–13 (2018)

    Google Scholar 

  12. Kuppusamy, P.: Human action recognition using CNN and LSTM-RNN with attention model. Int. J. Innov. Technol. Exploring Eng. (IJITEE) 8, 1639–1643 (2019)

    Google Scholar 

  13. Shou, Z., Wang, D., Chang, S.-F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), USA, pp. 1049–1058 (2016)

    Google Scholar 

  14. Lin, T., Zhao, X., Fan, Z.: Temporal action localization with two-stream segment-based RNN. In: IEEE International Conference on Image Processing (ICIP), Beijing, pp. 3400–3404 (2017)

    Google Scholar 

  15. Yao, G., Lei, T., Liu, X., Jiang, P.: Temporal action detection in untrimmed videos from fine to coarse granularity. Appl. Sci. 8(10), 1924 (2018)

    Article  Google Scholar 

  16. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the 15th IEEE International Conference on Computer Vision, ICCV 2015, pp. 4489–4497 (2015)

    Google Scholar 

  17. Karen, S., Andrew, Z.: Two-stream convolutional networks for action recognition in videos. Adv. Neural. Inf. Process. Syst. 1, 568–576 (2014)

    Google Scholar 

  18. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  19. Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1029–1048 (2016)

    Google Scholar 

  20. Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: IEEE International Conference on Computer Vision (ICCV), USA, pp. 4507–4515 (2015)

    Google Scholar 

  21. Jeff, D., et al.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Patt. Anal. Mach. Intell. 39(4), 677–691 (2017)

    Article  Google Scholar 

  22. Wang, H., et al.: Action recognition by dense trajectories. In: IEEE Conference on Computer Vision & Pattern Recognition (CVPR), USA (2011)

    Google Scholar 

  23. Yu, Y., Ko, H., Choi, J., Kim, G.: End-to-end concept word detection for video captioning, retrieval, and question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), USA, pp. 3261–3269 (2017)

    Google Scholar 

  24. Bird, S., et al.: Natural Language Processing with Python. O’Reilly Media Inc, California (2009)

    MATH  Google Scholar 

  25. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL 2002), USA, pp. 311–318 (2002)

    Google Scholar 

  26. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, Massachusetts, pp. 4566–4575 (2015)

    Google Scholar 

  27. Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_47

    Chapter  Google Scholar 

  28. Wang, L., Qiao, Y., Tang, X.: Action recognition and detection by combining motion and appearance features (2014)

    Google Scholar 

  29. Jingkuan, S., et al.: Hierarchical LSTM with adjusted temporal attention for video captioning. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Australia, pp. 2737–2743 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Noorhan K. Fawzy , Mohammed A. Marey or Mostafa M. Aref .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fawzy, N.K., Marey, M.A., Aref, M.M. (2021). Video Captioning Using Attention Based Visual Fusion with Bi-temporal Context and Bi-modal Semantic Feature Learning. In: Hassanien, A.E., Slowik, A., Snášel, V., El-Deeb, H., Tolba, F.M. (eds) Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2020. AISI 2020. Advances in Intelligent Systems and Computing, vol 1261. Springer, Cham. https://doi.org/10.1007/978-3-030-58669-0_6

Download citation

Publish with us

Policies and ethics