Abstract
Video captioning requires deeply understanding video content, describing the video concisely and accurately in one sentence. Since the video usually contains multiple atomic events, conventional methods using the attention mechanism or alignment between frame and word, lack deep inductive reasoning for multiple motions and appearances. Considering the inductive reasoning mechanism of the human brain, a brain-inspired deeper inductive reasoning model(DIR) is proposed in this paper. The DIR model discusses the inductive reasoning to presents the semantic similarity and dissimilarity of multiple atomic events, describing the video concisely and accurately. We evaluate the effectiveness of our method on public benchmarks (MSVD and MSR-VTT). Extensive experiments demonstrate that DIR outperforms general state-of-the-art methods, and show the advantages in deep reasoning compared with traditional captioning models.
Similar content being viewed by others
Data availability
The data that support the findings of this study are available from the corresponding author, F Xu, upon reasonable request.
References
Aafaq N, Akhtar N, Liu W, et al (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12,487–12,496
Banerjee S, Lavie A (2005) Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Burke HR (1958) Raven’s progressive matrices: a review and critical evaluation. J Genetic Psychol 93(2):199–228
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6299–6308
Chen D, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 190–200
Chen J, Pan Y, Li Y, et al (2019) Temporal deformable convolutional encoder-decoder networks for video captioning. In: Proceedings of the AAAI conference on artificial intelligence, pp 8167–8174
Chen S, Jiang YG (2019) Motion guided spatial attention for video captioning. In: Proceedings of the AAAI conference on artificial intelligence, pp 8191–8198
Chen S, Chen J, Jin Q, et al (2017) Video captioning with guidance of multimodal latent topics. In: Proceedings of the 25th ACM international conference on Multimedia, pp 1838–1846
Chen Y, Wang S, Zhang W, et al (2018) Less is more: Picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 358–373
Christou C, Papageorgiou E (2007) A framework of mathematics inductive reasoning. Learn Instruct 17(1):55–66
Donahue J, Hendricks LA, Guadarrama S, et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Gao L, Li X, Song J et al (2019) Hierarchical lstms with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell 42(5):1112–1131
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Hayes BK, Heit E (2018) Inductive reasoning 2.0. Wiley Interdisciplinary Reviews: Cognitive Science 9(3):e1459
He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He K, Fan H, Wu Y, et al (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9729–9738
Heit E (1999) A bayesian analysis of some forms of inductive reasoning. rational models of cognition
Hou J, Wu X, Zhao W, et al (2019) Joint syntax representation learning and visual cue translation for video captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 8918–8927
Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114
Krishna R, Hata K, Ren F, et al (2017) Dense-captioning events in videos. In: 2017 IEEE International Conference on Computer Vision (ICCV)
Lee JC, Lovibond PF, Hayes BK et al (2019) Negative evidence and inductive reasoning in generalization of associative learning. J Exp Psychol 148(2):289
Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81
Lu J, Xiong C, Parikh D, et al (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383
Mahon L, Giunchiglia E, Li B, et al (2020) Knowledge graph extraction from videos. In: 2020 19th IEEE International conference on machine learning and applications (ICMLA), IEEE, pp 25–32
Mikolov T, Chen K, Corrado G, et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Mun J, Yang L, Ren Z, et al (2019) Streamlined dense video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6588–6597
Pan B, Cai H, Huang DA, et al (2020) Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,870–10,879
Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Rasmussen D (2010) A neural modelling approach to investigating general intelligence. Master’s thesis, University of Waterloo
Rasmussen D, Eliasmith C (2011) A neural model of rule generation in inductive reasoning. Topics Cognit Sci 3(1):140–153
Sohn K, Yan X, Lee H, et al (2015) Learning structured output representation using deep conditional generative models. In: International conference on neural information processing systems
Tang P, Tan Y, Luo W (2022) Visual and language semantic hybrid enhancement and complementary for video description. Neural Comput Appl 34(8):5959–5977
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Venugopalan S, Rohrbach M, Donahue J, et al (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542
Wang T, Zheng H, Yu M et al (2020) Event-centric hierarchical representation for dense video captioning. IEEE Trans Circ Syst Video Technol 31(5):1890–1900
Xu J, Mei T, Yao T, et al (2016) Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296
Yan C, Tu Y, Wang X et al (2019) Stat: Spatial-temporal attention mechanism for video captioning. IEEE Trans Multimedia 22(1):229–241
Yan C, Hao Y, Li L et al (2021) Task-adaptive attention for image captioning. IEEE Trans Circ Syst Video Technol 32(1):43–51
Zhao B, Li X, Lu X (2019) Cam-rnn: Co-attention model based rnn for video captioning. IEEE Trans Image Process 28(11):5552–5565
Zheng Q, Wang C, Tao D (2020) Syntax-aware action targeting for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13,096–13,105
Acknowledgements
This work was supported by the Fundamental Research Funds for the Central Universities B220202019, Changzhou Sci &Tech Program (Grant No. CJ20210092),Young Talent Development Plan of Changzhou Health Commission (Grant No. CZQM2020025), and the Key Research and Development Program of Jiangsu under grants BK20192004, BE2018004-04.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yao, X., Xu, F., Gu, M. et al. Brain-inspired learning to deeper inductive reasoning for video captioning. Int. J. Mach. Learn. & Cyber. 14, 3979–3991 (2023). https://doi.org/10.1007/s13042-023-01876-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-023-01876-9