A Fully Dynamic Context Guided Reasoning and Reconsidering Network for Video Captioning

Feng, Xia; He, Xinyu; Huang, Rui; Liu, Caihua

doi:10.1007/978-3-030-89188-6_13

Xia Feng^12,13,
Xinyu He^12,13,
Rui Huang¹² &
…
Caihua Liu^12,13

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13031))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

2301 Accesses

Abstract

Visual reasoning and reconsidering capabilities are instinctively executed alternately as people watch a video and attempt to describe its contents with natural language. Inspired by this, a novel network that joints fully dynamic context guided reasoning and reconsidering is proposed in this paper. Specifically, an elaborate reconsidering module referred to as the reconsiderator is employed for rethinking and sharpening the preliminary results of stepwise reasoning from coarse to fine, thereby generating a higher quality description. And in turn, the reasoning capability of the network can be further boosted under the guidance of the context information summarized during reconsidering. Extensive experiments on two public benchmarks demonstrate that our approach is pretty competitive with the state-of-the-art methods.

This work was supported by the Natural Science Foundation of Tianjin (No. 20JCQNJC00720) and the Fundamental Research Funds for the Central Universities, CAUC (No. 3122021052).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Semantic guidance network for video captioning

Article Open access 26 September 2023

Guide and interact: scene-graph based generation and control of video captions

Article 14 November 2022

Adaptive Attention Mechanism Based Semantic Compositional Network for Video Captioning

Notes

1.
https://spacy.io.

References

Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)
Google Scholar
Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Chen, D., Dolan, W.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 190–200 (2011)
Google Scholar
Chen, S., Jiang, Y.G.: Motion guided spatial attention for video captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8191–8198 (2019)
Google Scholar
Gao, L., Fan, K., Song, J., Liu, X., Xu, X., Shen, H.T.: Deliberate attention networks for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8320–8327 (2019)
Google Scholar
Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: Iqa: visual question answering in interactive environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Hou, J., Wu, X., Zhao, W., Luo, J., Jia, Y.: Joint syntax representation learning and visual cue translation for video captioning. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. In: ICLR (2017)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vision 50(2), 171–184 (2002)
Article Google Scholar
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Google Scholar
Lu, M., Li, X., Liu, C.: Context visual information-based deliberation network for video captioning. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 9812–9818 (2021)
Google Scholar
Nguyen, A., Kanoulas, D., Muratore, L., Caldwell, D.G., Tsagarakis, N.G.: Translating videos to commands for robotic manipulation with deep recurrent neural networks. In: ICRA (2018)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Pei, W., Zhang, J., Wang, X., Ke, L., Shen, X., Tai, Y.W.: Memory-attended recurrent network for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Article Google Scholar
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Ryu, H., Kang, S., Kang, H., Yoo, C.D.: Semantic grouping network for video captioning. arXiv preprint arXiv:2102.00831 (2021)
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)
Google Scholar
Tan, G., Liu, D., Wang, M., Zha, Z.J.: Learning to discretely compose reasoning module networks for video captioning. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI), pp. 745–752 (2020)
Google Scholar
Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., Mooney, R.: Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 1218–1227 (2014)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014)
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence - video to text. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Voykinska, V., Azenkot, S., Wu, S., Leshed, G.: How blind people interact with visual content on social networking services. In: Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, pp. 1584–1595 (2016)
Google Scholar
Wang, B., Ma, L., Zhang, W., Jiang, W., Wang, J., Liu, W.: Controllable video captioning with pos sequence guidance based on gated fusion network. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-vtt: a large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Xu, J., Yao, T., Zhang, Y., Mei, T.: Learning multimodal attention LSTM networks for video captioning. In: Proceedings of the 25th ACM International Conference on Multimedia, MM 2017, pp. 537–545 (2017)
Google Scholar
Yao, L., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Zha, Z.J., Liu, D., Zhang, H., Zhang, Y., Wu, F.: Context-aware visual policy network for fine-grained image captioning. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 1 (2019). https://ieeexplore.ieee.org/document/8684270

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Civil Aviation University of China, Tianjin, China
Xia Feng, Xinyu He, Rui Huang & Caihua Liu
Information Technology Base of Civil Aviation Administration of China, CAUC, 2898 Jinbei Road, Dongli District, Tianjin, 300300, China
Xia Feng, Xinyu He & Caihua Liu

Authors

Xia Feng
View author publications
You can also search for this author in PubMed Google Scholar
Xinyu He
View author publications
You can also search for this author in PubMed Google Scholar
Rui Huang
View author publications
You can also search for this author in PubMed Google Scholar
Caihua Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Caihua Liu .

Editor information

Editors and Affiliations

MIMOS Berhad, Kuala Lumpur, Malaysia
Duc Nghia Pham
Sirindhorn International Institute of Science and Technology, Thammasat University, Mueang Pathum Thani, Thailand
Thanaruk Theeramunkong
Data61, CSIRO, Brisbane, QLD, Australia
Guido Governatori
Department of Philosophy, Tsinghua University, Beijing, China
Fenrong Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Feng, X., He, X., Huang, R., Liu, C. (2021). A Fully Dynamic Context Guided Reasoning and Reconsidering Network for Video Captioning. In: Pham, D.N., Theeramunkong, T., Governatori, G., Liu, F. (eds) PRICAI 2021: Trends in Artificial Intelligence. PRICAI 2021. Lecture Notes in Computer Science(), vol 13031. Springer, Cham. https://doi.org/10.1007/978-3-030-89188-6_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-89188-6_13
Published: 25 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89187-9
Online ISBN: 978-3-030-89188-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Fully Dynamic Context Guided Reasoning and Reconsidering Network for Video Captioning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Semantic guidance network for video captioning

Guide and interact: scene-graph based generation and control of video captions

Adaptive Attention Mechanism Based Semantic Compositional Network for Video Captioning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Fully Dynamic Context Guided Reasoning and Reconsidering Network for Video Captioning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Semantic guidance network for video captioning

Guide and interact: scene-graph based generation and control of video captions

Adaptive Attention Mechanism Based Semantic Compositional Network for Video Captioning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation