Video Captioning Using Attention Based Visual Fusion with Bi-temporal Context and Bi-modal Semantic Feature Learning

Fawzy, Noorhan K.; Marey, Mohammed A.; Aref, Mostafa M.

doi:10.1007/978-3-030-58669-0_6

Noorhan K. Fawzy¹⁹,
Mohammed A. Marey¹⁹ &
Mostafa M. Aref¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1261))

Included in the following conference series:

International Conference on Advanced Intelligent Systems and Informatics

3712 Accesses

Abstract

Video captioning is a recent emerging task that describes a video through generating a natural language sentence. Practically videos are untrimmed where both localizing and describing the event of interest is crucial for many vision based- real life applications. This paper proposes a deep neural network framework for effective video event localization through using a bidirectional Long Short Term Memory (LSTM) that encodes past, current and future context information. Our framework adopts an encoder decoder network that accepts the event proposal with highest temporal intersection with ground truth for captioning. Our encoder is fed with attentively fused visual features, extracted by a two stream 3D convolution neural network, along with the proposal’s context information for generating an effective representation. Our decoder accepts learnt semantic features that represent bi-modal (two modes) high-level semantic concepts. We conduct experiments to demonstrate that utilizing both semantic features and contextual information provides better captioning performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Modeling Context-Guided Visual and Linguistic Semantic Feature for Video Captioning

Semantic-Guided Multi-feature Fusion for Accurate Video Captioning

Exploiting long-term temporal dynamics for video captioning

Article 02 March 2018

References

Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, pp. 5288–5296 (2016)
Google Scholar
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 1494–1504. North American Chapter of the Association for Computational Linguistics (NAACL), Colorado (2015)
Google Scholar
Mahdisoltani, F., Berger, G., Gharbieh, W., Fleet, D., Memisevic, R.: Fine-grained video classification and captioning. ArXiv_CV (2018)
Google Scholar
Buch, S., Escorcia, V., Shen, C., Ghanem, B., Carlos Niebles, J.: SST: single-stream temporal action proposals. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, pp. 6373–6382 (2017)
Google Scholar
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City , pp. 6546–6555 (2018)
Google Scholar
Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3D residual networks for action recognition. In: IEEE International Conference on Computer Vision Workshop (ICCVW), pp. 3154–3160 (2017)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, pp. 4724–4733 (2017)
Google Scholar
Wang, X., Miao, Z., Zhang, R., Hao, S.: I3D-LSTM: a new model for human action recognition. In: IOP Conference Series: Materials Science and Engineering (2019)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset. ArXiv (2017)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: A dataset of 101 human actions classes from videos in the wild. ArXiv (2012)
Google Scholar
Zhao, Y., Yang, R., Chevalier, G., Xu, X., Zhang, Z.: Deep residual bidir-LSTM for human activity recognition using wearable sensors. Math. Prob. Eng. 1–13 (2018)
Google Scholar
Kuppusamy, P.: Human action recognition using CNN and LSTM-RNN with attention model. Int. J. Innov. Technol. Exploring Eng. (IJITEE) 8, 1639–1643 (2019)
Google Scholar
Shou, Z., Wang, D., Chang, S.-F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), USA, pp. 1049–1058 (2016)
Google Scholar
Lin, T., Zhao, X., Fan, Z.: Temporal action localization with two-stream segment-based RNN. In: IEEE International Conference on Image Processing (ICIP), Beijing, pp. 3400–3404 (2017)
Google Scholar
Yao, G., Lei, T., Liu, X., Jiang, P.: Temporal action detection in untrimmed videos from fine to coarse granularity. Appl. Sci. 8(10), 1924 (2018)
Article Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the 15th IEEE International Conference on Computer Vision, ICCV 2015, pp. 4489–4497 (2015)
Google Scholar
Karen, S., Andrew, Z.: Two-stream convolutional networks for action recognition in videos. Adv. Neural. Inf. Process. Syst. 1, 568–576 (2014)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1029–1048 (2016)
Google Scholar
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: IEEE International Conference on Computer Vision (ICCV), USA, pp. 4507–4515 (2015)
Google Scholar
Jeff, D., et al.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Patt. Anal. Mach. Intell. 39(4), 677–691 (2017)
Article Google Scholar
Wang, H., et al.: Action recognition by dense trajectories. In: IEEE Conference on Computer Vision & Pattern Recognition (CVPR), USA (2011)
Google Scholar
Yu, Y., Ko, H., Choi, J., Kim, G.: End-to-end concept word detection for video captioning, retrieval, and question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), USA, pp. 3261–3269 (2017)
Google Scholar
Bird, S., et al.: Natural Language Processing with Python. O’Reilly Media Inc, California (2009)
MATH Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL 2002), USA, pp. 311–318 (2002)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, Massachusetts, pp. 4566–4575 (2015)
Google Scholar
Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_47
Chapter Google Scholar
Wang, L., Qiao, Y., Tang, X.: Action recognition and detection by combining motion and appearance features (2014)
Google Scholar
Jingkuan, S., et al.: Hierarchical LSTM with adjusted temporal attention for video captioning. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Australia, pp. 2737–2743 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer and Information Sciences, Ain Shames University, Cairo, Egypt
Noorhan K. Fawzy, Mohammed A. Marey & Mostafa M. Aref

Authors

Noorhan K. Fawzy
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed A. Marey
View author publications
You can also search for this author in PubMed Google Scholar
Mostafa M. Aref
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Noorhan K. Fawzy , Mohammed A. Marey or Mostafa M. Aref .

Editor information

Editors and Affiliations

Faculty of Computers and Artificial Intelligence, Information Technology Department, and Chair of the Scientific Research Group in Egypt, Cairo University, Cairo, Egypt
Aboul Ella Hassanien
Department of Electronics and Computer Science, Koszalin University of Technology, Koszalin, Poland
Adam Slowik
Faculty of Electrical Engineering and Computer Science, VŠB-Technical University of Ostrava, Ostrava-Poruba, Moravskoslezsky, Czech Republic
Václav Snášel
Rector of the Electronic Research Institute, Cairo, Egypt
Hisham El-Deeb
Faculty of computers and information, Ain Shams University, Cairo, Egypt
Fahmy M. Tolba

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fawzy, N.K., Marey, M.A., Aref, M.M. (2021). Video Captioning Using Attention Based Visual Fusion with Bi-temporal Context and Bi-modal Semantic Feature Learning. In: Hassanien, A.E., Slowik, A., Snášel, V., El-Deeb, H., Tolba, F.M. (eds) Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2020. AISI 2020. Advances in Intelligent Systems and Computing, vol 1261. Springer, Cham. https://doi.org/10.1007/978-3-030-58669-0_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-58669-0_6
Published: 20 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58668-3
Online ISBN: 978-3-030-58669-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics