Abstract
Video captioning is a recent emerging task that describes a video through generating a natural language sentence. Practically videos are untrimmed where both localizing and describing the event of interest is crucial for many vision based- real life applications. This paper proposes a deep neural network framework for effective video event localization through using a bidirectional Long Short Term Memory (LSTM) that encodes past, current and future context information. Our framework adopts an encoder decoder network that accepts the event proposal with highest temporal intersection with ground truth for captioning. Our encoder is fed with attentively fused visual features, extracted by a two stream 3D convolution neural network, along with the proposal’s context information for generating an effective representation. Our decoder accepts learnt semantic features that represent bi-modal (two modes) high-level semantic concepts. We conduct experiments to demonstrate that utilizing both semantic features and contextual information provides better captioning performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, pp. 5288–5296 (2016)
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 1494–1504. North American Chapter of the Association for Computational Linguistics (NAACL), Colorado (2015)
Mahdisoltani, F., Berger, G., Gharbieh, W., Fleet, D., Memisevic, R.: Fine-grained video classification and captioning. ArXiv_CV (2018)
Buch, S., Escorcia, V., Shen, C., Ghanem, B., Carlos Niebles, J.: SST: single-stream temporal action proposals. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, pp. 6373–6382 (2017)
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City , pp. 6546–6555 (2018)
Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3D residual networks for action recognition. In: IEEE International Conference on Computer Vision Workshop (ICCVW), pp. 3154–3160 (2017)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, pp. 4724–4733 (2017)
Wang, X., Miao, Z., Zhang, R., Hao, S.: I3D-LSTM: a new model for human action recognition. In: IOP Conference Series: Materials Science and Engineering (2019)
Kay, W., et al.: The kinetics human action video dataset. ArXiv (2017)
Soomro, K., Zamir, A.R., Shah, M.: A dataset of 101 human actions classes from videos in the wild. ArXiv (2012)
Zhao, Y., Yang, R., Chevalier, G., Xu, X., Zhang, Z.: Deep residual bidir-LSTM for human activity recognition using wearable sensors. Math. Prob. Eng. 1–13 (2018)
Kuppusamy, P.: Human action recognition using CNN and LSTM-RNN with attention model. Int. J. Innov. Technol. Exploring Eng. (IJITEE) 8, 1639–1643 (2019)
Shou, Z., Wang, D., Chang, S.-F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), USA, pp. 1049–1058 (2016)
Lin, T., Zhao, X., Fan, Z.: Temporal action localization with two-stream segment-based RNN. In: IEEE International Conference on Image Processing (ICIP), Beijing, pp. 3400–3404 (2017)
Yao, G., Lei, T., Liu, X., Jiang, P.: Temporal action detection in untrimmed videos from fine to coarse granularity. Appl. Sci. 8(10), 1924 (2018)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the 15th IEEE International Conference on Computer Vision, ICCV 2015, pp. 4489–4497 (2015)
Karen, S., Andrew, Z.: Two-stream convolutional networks for action recognition in videos. Adv. Neural. Inf. Process. Syst. 1, 568–576 (2014)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1029–1048 (2016)
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: IEEE International Conference on Computer Vision (ICCV), USA, pp. 4507–4515 (2015)
Jeff, D., et al.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Patt. Anal. Mach. Intell. 39(4), 677–691 (2017)
Wang, H., et al.: Action recognition by dense trajectories. In: IEEE Conference on Computer Vision & Pattern Recognition (CVPR), USA (2011)
Yu, Y., Ko, H., Choi, J., Kim, G.: End-to-end concept word detection for video captioning, retrieval, and question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), USA, pp. 3261–3269 (2017)
Bird, S., et al.: Natural Language Processing with Python. O’Reilly Media Inc, California (2009)
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL 2002), USA, pp. 311–318 (2002)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, Massachusetts, pp. 4566–4575 (2015)
Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_47
Wang, L., Qiao, Y., Tang, X.: Action recognition and detection by combining motion and appearance features (2014)
Jingkuan, S., et al.: Hierarchical LSTM with adjusted temporal attention for video captioning. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Australia, pp. 2737–2743 (2017)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fawzy, N.K., Marey, M.A., Aref, M.M. (2021). Video Captioning Using Attention Based Visual Fusion with Bi-temporal Context and Bi-modal Semantic Feature Learning. In: Hassanien, A.E., Slowik, A., Snášel, V., El-Deeb, H., Tolba, F.M. (eds) Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2020. AISI 2020. Advances in Intelligent Systems and Computing, vol 1261. Springer, Cham. https://doi.org/10.1007/978-3-030-58669-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-58669-0_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58668-3
Online ISBN: 978-3-030-58669-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)