Skip to main content
Log in

Brain-inspired learning to deeper inductive reasoning for video captioning

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Video captioning requires deeply understanding video content, describing the video concisely and accurately in one sentence. Since the video usually contains multiple atomic events, conventional methods using the attention mechanism or alignment between frame and word, lack deep inductive reasoning for multiple motions and appearances. Considering the inductive reasoning mechanism of the human brain, a brain-inspired deeper inductive reasoning model(DIR) is proposed in this paper. The DIR model discusses the inductive reasoning to presents the semantic similarity and dissimilarity of multiple atomic events, describing the video concisely and accurately. We evaluate the effectiveness of our method on public benchmarks (MSVD and MSR-VTT). Extensive experiments demonstrate that DIR outperforms general state-of-the-art methods, and show the advantages in deep reasoning compared with traditional captioning models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

The data that support the findings of this study are available from the corresponding author, F Xu, upon reasonable request.

References

  1. Aafaq N, Akhtar N, Liu W, et al (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12,487–12,496

  2. Banerjee S, Lavie A (2005) Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72

  3. Burke HR (1958) Raven’s progressive matrices: a review and critical evaluation. J Genetic Psychol 93(2):199–228

    Article  Google Scholar 

  4. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6299–6308

  5. Chen D, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 190–200

  6. Chen J, Pan Y, Li Y, et al (2019) Temporal deformable convolutional encoder-decoder networks for video captioning. In: Proceedings of the AAAI conference on artificial intelligence, pp 8167–8174

  7. Chen S, Jiang YG (2019) Motion guided spatial attention for video captioning. In: Proceedings of the AAAI conference on artificial intelligence, pp 8191–8198

  8. Chen S, Chen J, Jin Q, et al (2017) Video captioning with guidance of multimodal latent topics. In: Proceedings of the 25th ACM international conference on Multimedia, pp 1838–1846

  9. Chen Y, Wang S, Zhang W, et al (2018) Less is more: Picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 358–373

  10. Christou C, Papageorgiou E (2007) A framework of mathematics inductive reasoning. Learn Instruct 17(1):55–66

    Article  Google Scholar 

  11. Donahue J, Hendricks LA, Guadarrama S, et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  12. Gao L, Li X, Song J et al (2019) Hierarchical lstms with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell 42(5):1112–1131

    Google Scholar 

  13. Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448

  14. Hayes BK, Heit E (2018) Inductive reasoning 2.0. Wiley Interdisciplinary Reviews: Cognitive Science 9(3):e1459

  15. He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  16. He K, Fan H, Wu Y, et al (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9729–9738

  17. Heit E (1999) A bayesian analysis of some forms of inductive reasoning. rational models of cognition

  18. Hou J, Wu X, Zhao W, et al (2019) Joint syntax representation learning and visual cue translation for video captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 8918–8927

  19. Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114

  20. Krishna R, Hata K, Ren F, et al (2017) Dense-captioning events in videos. In: 2017 IEEE International Conference on Computer Vision (ICCV)

  21. Lee JC, Lovibond PF, Hayes BK et al (2019) Negative evidence and inductive reasoning in generalization of associative learning. J Exp Psychol 148(2):289

    Article  Google Scholar 

  22. Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81

  23. Lu J, Xiong C, Parikh D, et al (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383

  24. Mahon L, Giunchiglia E, Li B, et al (2020) Knowledge graph extraction from videos. In: 2020 19th IEEE International conference on machine learning and applications (ICMLA), IEEE, pp 25–32

  25. Mikolov T, Chen K, Corrado G, et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  26. Mun J, Yang L, Ren Z, et al (2019) Streamlined dense video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6588–6597

  27. Pan B, Cai H, Huang DA, et al (2020) Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,870–10,879

  28. Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318

  29. Rasmussen D (2010) A neural modelling approach to investigating general intelligence. Master’s thesis, University of Waterloo

  30. Rasmussen D, Eliasmith C (2011) A neural model of rule generation in inductive reasoning. Topics Cognit Sci 3(1):140–153

    Article  Google Scholar 

  31. Sohn K, Yan X, Lee H, et al (2015) Learning structured output representation using deep conditional generative models. In: International conference on neural information processing systems

  32. Tang P, Tan Y, Luo W (2022) Visual and language semantic hybrid enhancement and complementary for video description. Neural Comput Appl 34(8):5959–5977

    Article  Google Scholar 

  33. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

  34. Venugopalan S, Rohrbach M, Donahue J, et al (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542

  35. Wang T, Zheng H, Yu M et al (2020) Event-centric hierarchical representation for dense video captioning. IEEE Trans Circ Syst Video Technol 31(5):1890–1900

    Article  Google Scholar 

  36. Xu J, Mei T, Yao T, et al (2016) Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296

  37. Yan C, Tu Y, Wang X et al (2019) Stat: Spatial-temporal attention mechanism for video captioning. IEEE Trans Multimedia 22(1):229–241

    Article  Google Scholar 

  38. Yan C, Hao Y, Li L et al (2021) Task-adaptive attention for image captioning. IEEE Trans Circ Syst Video Technol 32(1):43–51

    Article  Google Scholar 

  39. Zhao B, Li X, Lu X (2019) Cam-rnn: Co-attention model based rnn for video captioning. IEEE Trans Image Process 28(11):5552–5565

    Article  MathSciNet  MATH  Google Scholar 

  40. Zheng Q, Wang C, Tao D (2020) Syntax-aware action targeting for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13,096–13,105

Download references

Acknowledgements

This work was supported by the Fundamental Research Funds for the Central Universities B220202019, Changzhou Sci &Tech Program (Grant No. CJ20210092),Young Talent Development Plan of Changzhou Health Commission (Grant No. CZQM2020025), and the Key Research and Development Program of Jiangsu under grants BK20192004, BE2018004-04.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feiyang Xu.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yao, X., Xu, F., Gu, M. et al. Brain-inspired learning to deeper inductive reasoning for video captioning. Int. J. Mach. Learn. & Cyber. 14, 3979–3991 (2023). https://doi.org/10.1007/s13042-023-01876-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-023-01876-9

Keywords

Navigation