Brain-inspired learning to deeper inductive reasoning for video captioning

Yao, Xiao; Xu, Feiyang; Gu, Min; Wang, Peipei

doi:10.1007/s13042-023-01876-9

Brain-inspired learning to deeper inductive reasoning for video captioning

Original Article
Published: 10 June 2023

Volume 14, pages 3979–3991, (2023)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Xiao Yao¹,
Feiyang Xu¹,
Min Gu² &
…
Peipei Wang²

188 Accesses
Explore all metrics

Abstract

Video captioning requires deeply understanding video content, describing the video concisely and accurately in one sentence. Since the video usually contains multiple atomic events, conventional methods using the attention mechanism or alignment between frame and word, lack deep inductive reasoning for multiple motions and appearances. Considering the inductive reasoning mechanism of the human brain, a brain-inspired deeper inductive reasoning model(DIR) is proposed in this paper. The DIR model discusses the inductive reasoning to presents the semantic similarity and dissimilarity of multiple atomic events, describing the video concisely and accurately. We evaluate the effectiveness of our method on public benchmarks (MSVD and MSR-VTT). Extensive experiments demonstrate that DIR outperforms general state-of-the-art methods, and show the advantages in deep reasoning compared with traditional captioning models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review on the long short-term memory model

Article 13 May 2020

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

Data availability

The data that support the findings of this study are available from the corresponding author, F Xu, upon reasonable request.

References

Aafaq N, Akhtar N, Liu W, et al (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12,487–12,496
Banerjee S, Lavie A (2005) Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Burke HR (1958) Raven’s progressive matrices: a review and critical evaluation. J Genetic Psychol 93(2):199–228
Article Google Scholar
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6299–6308
Chen D, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 190–200
Chen J, Pan Y, Li Y, et al (2019) Temporal deformable convolutional encoder-decoder networks for video captioning. In: Proceedings of the AAAI conference on artificial intelligence, pp 8167–8174
Chen S, Jiang YG (2019) Motion guided spatial attention for video captioning. In: Proceedings of the AAAI conference on artificial intelligence, pp 8191–8198
Chen S, Chen J, Jin Q, et al (2017) Video captioning with guidance of multimodal latent topics. In: Proceedings of the 25th ACM international conference on Multimedia, pp 1838–1846
Chen Y, Wang S, Zhang W, et al (2018) Less is more: Picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 358–373
Christou C, Papageorgiou E (2007) A framework of mathematics inductive reasoning. Learn Instruct 17(1):55–66
Article Google Scholar
Donahue J, Hendricks LA, Guadarrama S, et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Gao L, Li X, Song J et al (2019) Hierarchical lstms with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell 42(5):1112–1131
Google Scholar
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Hayes BK, Heit E (2018) Inductive reasoning 2.0. Wiley Interdisciplinary Reviews: Cognitive Science 9(3):e1459
He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He K, Fan H, Wu Y, et al (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9729–9738
Heit E (1999) A bayesian analysis of some forms of inductive reasoning. rational models of cognition
Hou J, Wu X, Zhao W, et al (2019) Joint syntax representation learning and visual cue translation for video captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 8918–8927
Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114
Krishna R, Hata K, Ren F, et al (2017) Dense-captioning events in videos. In: 2017 IEEE International Conference on Computer Vision (ICCV)
Lee JC, Lovibond PF, Hayes BK et al (2019) Negative evidence and inductive reasoning in generalization of associative learning. J Exp Psychol 148(2):289
Article Google Scholar
Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81
Lu J, Xiong C, Parikh D, et al (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383
Mahon L, Giunchiglia E, Li B, et al (2020) Knowledge graph extraction from videos. In: 2020 19th IEEE International conference on machine learning and applications (ICMLA), IEEE, pp 25–32
Mikolov T, Chen K, Corrado G, et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Mun J, Yang L, Ren Z, et al (2019) Streamlined dense video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6588–6597
Pan B, Cai H, Huang DA, et al (2020) Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,870–10,879
Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Rasmussen D (2010) A neural modelling approach to investigating general intelligence. Master’s thesis, University of Waterloo
Rasmussen D, Eliasmith C (2011) A neural model of rule generation in inductive reasoning. Topics Cognit Sci 3(1):140–153
Article Google Scholar
Sohn K, Yan X, Lee H, et al (2015) Learning structured output representation using deep conditional generative models. In: International conference on neural information processing systems
Tang P, Tan Y, Luo W (2022) Visual and language semantic hybrid enhancement and complementary for video description. Neural Comput Appl 34(8):5959–5977
Article Google Scholar
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Venugopalan S, Rohrbach M, Donahue J, et al (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542
Wang T, Zheng H, Yu M et al (2020) Event-centric hierarchical representation for dense video captioning. IEEE Trans Circ Syst Video Technol 31(5):1890–1900
Article Google Scholar
Xu J, Mei T, Yao T, et al (2016) Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296
Yan C, Tu Y, Wang X et al (2019) Stat: Spatial-temporal attention mechanism for video captioning. IEEE Trans Multimedia 22(1):229–241
Article Google Scholar
Yan C, Hao Y, Li L et al (2021) Task-adaptive attention for image captioning. IEEE Trans Circ Syst Video Technol 32(1):43–51
Article Google Scholar
Zhao B, Li X, Lu X (2019) Cam-rnn: Co-attention model based rnn for video captioning. IEEE Trans Image Process 28(11):5552–5565
Article MathSciNet MATH Google Scholar
Zheng Q, Wang C, Tao D (2020) Syntax-aware action targeting for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13,096–13,105

Download references

Acknowledgements

This work was supported by the Fundamental Research Funds for the Central Universities B220202019, Changzhou Sci &Tech Program (Grant No. CJ20210092),Young Talent Development Plan of Changzhou Health Commission (Grant No. CZQM2020025), and the Key Research and Development Program of Jiangsu under grants BK20192004, BE2018004-04.

Author information

Authors and Affiliations

The College of IoT Engineering, Hohai University, Nanjing, China
Xiao Yao & Feiyang Xu
The First People’s Hospital of Changzhou, Changzhou, China
Min Gu & Peipei Wang

Authors

Xiao Yao
View author publications
You can also search for this author in PubMed Google Scholar
Feiyang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Min Gu
View author publications
You can also search for this author in PubMed Google Scholar
Peipei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Feiyang Xu.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yao, X., Xu, F., Gu, M. et al. Brain-inspired learning to deeper inductive reasoning for video captioning. Int. J. Mach. Learn. & Cyber. 14, 3979–3991 (2023). https://doi.org/10.1007/s13042-023-01876-9

Download citation

Received: 05 October 2022
Accepted: 15 May 2023
Published: 10 June 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s13042-023-01876-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Brain-inspired learning to deeper inductive reasoning for video captioning

Abstract

Access this article

Similar content being viewed by others

A review on the long short-term memory model

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Video summarization using deep learning techniques: a detailed analysis and investigation

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Brain-inspired learning to deeper inductive reasoning for video captioning

Abstract

Access this article

Similar content being viewed by others

A review on the long short-term memory model

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Video summarization using deep learning techniques: a detailed analysis and investigation

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation