Skip to main content

Advertisement

Log in

Cross-language multimodal scene semantic guidance and leap sampling for video captioning

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

In recent years, video captioning, which uses natural language to describe video content, has achieved encouraging results. However, most of the previous studies in this area have focused on directly decoding video encoding and have thus rarely explored the role of scene semantics in caption generation, especially cross-language and multimodal. Obviously, the same video can be described with different languages, which have different forms and are inherently related. Meanwhile, despite high evaluation scores, some generated captions cannot represent the video content with many nonentity words. Based on the analysis, in this paper, we propose a cross-language scene semantic guidance caption model. It first learns the high-level scene semantics of a video in different languages, from which multilanguage features are extracted. Then, the features characterize the video content and guide the generated captions. They make the captions converge toward the video content. In addition, we also apply a leap sampling method for learning entity words in the model so as to better represent the video content. Moreover, experiments on the public MSR-VTT and VATEX datasets show that our model is effective. Finally, we establish a multilingual student classroom behavior caption dataset under an education scenario, providing a basis for research into captioning tasks in the education area. We also apply our model to this dataset and achieve certain results. The dataset is available to download online: https://github.com/BNU-Wu/Student-Class-Behavior-Dataset/tree/master.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Ma L., Lu Z., Shang L., Li H: Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE international conference on computer vision, pp. 2623–2631 (2015)

  2. Wang, J., Liu, W., Kumar, S., Chang, S.: Learning to hash for indexing big data—a survey. Proc IEEE 104(1), 34–57 (2016)

    Article  Google Scholar 

  3. Liu, W., Zhang, T.: Multimedia hashing and networking. IEEE Multimedia 23, 75–79 (2016)

    Article  Google Scholar 

  4. Song, J., Gao, L., Liu, L., Zhu, X., Sebe, N.: Quantization-based hashing: a general framework for scalable image and video retrieval. Pattern Recogn 75, 175–187 (2018)

    Article  Google Scholar 

  5. Wang, J., Zhang, T., Song, J., Sebe, N., Shen, H.: A Survey on Learning to Hash. IEEE Trans. Pattern Anal. Mach. Intel. 40(4), 769–790 (2018)

    Article  Google Scholar 

  6. Pradhan J., Ajad A., Pal A.K., et al.: Multi-level colored directional motif histograms for content-based image retrieval. Visual Computer, 36(9) (2020)

  7. Feng, B., Cao, J., et al.: Graph-based multi-space semantic correlation propagation for video retrieval. Visual Comput 27(1), 21–34 (2011)

    Article  Google Scholar 

  8. Hashemi, S.H., Safayani, M., Mirzaei, A.: Multiple answers to a question: a new approach for visual question answering. Visual Comput 37(7), 119–131 (2021)

    Google Scholar 

  9. Ma, L., Lu, Z., Li, H.: Learning to answer questions from image using convolutional neural network. Assoc Adv Artificial Intell 3, 16 (2016)

    Google Scholar 

  10. Haijun, Z., Yuzhu, J., Wang, H., Linlin, L.: Sitcom-star-based clothing retrieval for video advertising: a deep learning framework. Neural Comput Appl (2019). https://doi.org/10.1007/s00521-018-3579-x

    Article  Google Scholar 

  11. Barlas, G., Veinidis, C., Arampatzis, A.: What we see in a photograph: content selection for image captioning. Vis Comput 37, 1309–1326 (2021)

    Article  Google Scholar 

  12. Jiang, T., Zhang, Z., Yang, Y.: Modeling coverage with semantic embedding for image caption generation. Vis Computer 35, 1655–1665 (2019)

    Article  Google Scholar 

  13. Donahue J., Hendricks L. A., Guadarrama S., Rohrbach M., Darrell TJIToSE.: Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. pp. 99 (2014)

  14. Marwah T., Mittal G., Balasubramanian V. N.: Attentive semantic video generation using captions. In: Proceedings of the IEEE international conference on computer vision, pp. 1435–1443 (2017)

  15. Venugopalan S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, pp. 4534–4542 (2015)

  16. Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp. 4507–4515 (2015)

  17. Yu H., Wang J., Huang Z., Yang Y., Xu W.: Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4584–4593 (2016)

  18. Pan P., Xu Z., Yang Y., Wu F., Zhuang Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1029–1038 (2016)

  19. Pan Y., Yao T., Li H., T. Mei.: Video captioning with transferred semantic attributes. arXiv preprint arXiv: 1611.07675 (2016)

  20. Venugopalan, S., Hendricks, L. A., Mooney, R., Saenko, K., Improving lstm-based video description with linguistic knowledge mined from text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. pp. 1961–1966 (2016)

  21. Xin Wang, Jiawei Wu, Da Zhang, Yu Su, and William Yang Wang.: Learning to compose topic-aware mixture of experts for zero-shot video captioning. arXiv preprint arXiv: 1811.02765. (2018)

  22. Li, Y., Yao, T., Pan, Y., Chao, H., Mei, T.: Jointly localizing and describing events for dense video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7492–7500 (2018)

  23. Dong J. et al.: Early embedding and late reranking for video captioning. In: Proceedings of the 2016 ACM on multimedia conference, pp 1082–1086 (2016)

  24. Yu Y., Ko H., Choi J., Kim G.: End-to-end concept word detection for video captioning, retrieval, and question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3261–3269 (2017)

  25. Zhang, Z., Shi, Y., Yuan, C., Li, B., Wang, P., Hu, W., Zha, Z.-J.: Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 13278– 13288 (2020)

  26. Xu J., Mei T., Yao T., Rui Y.: Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288–5296 (2016)

  27. Chen D., Dolan W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 190–200 (2011)

  28. Krishna R., Hata K., Ren F., Fei-Fei L., Carlos Niebles J.: Dense-captioning events in videos. In: Proceedings of the IEEE international conference on computer vision, pp 706–715 (2017)

  29. Zhou L., Kalantidis Y., Chen X, Corso J.J., Rohrbach M.: Grounded video description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6578–6587 (2019)

  30. Wang X, Wu J, Chen J, Li L, Wang YF, Wang WY.: VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. In: Proceedings of the IEEE international conference on computer vision, pp. 4580–4590 (2019)

  31. Guadarrama S., Krishnamoorthy N., Malkarnenkar G., Venugopalan S., Saenko K.: YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2712–2719 (2014)

  32. Krishnamoorthy N., Malkarnenkar G., Mooney R., Saenko K., Guadarrama S.: Generating natural-language video descriptions using text-mined knowledge. The Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, pp. 541–547 (2013)

  33. Thomason J., Venugopalan S., Guadarrama S., Saenko K., Mooney R.: Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceeding of the 24th International Conference on Computational Linguistics, pp. 1218–1227 (2014)

  34. Wang X, Wang YF, Wang WY.: Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 795–801 (2018)

  35. Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L., Deng, L.: Semantic compositional networks for visual captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1141–1150 (2017)

  36. Shen, Z., Li, J., Su, Z., Li, M., Chen, Y., Jiang, Y.-G., Xue, X.: Weakly supervised dense video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5159–5167 (2017)

  37. Pasunuru, R., Bansal, M.: Multi-task video captioning with video and entailment generation. In: Proceedings of the 55th annual meeting of the Association for Computational Linguistics. arXiv: 1704.07489 (2017)

  38. Chen S., Zhao Y., Jin Q., Wu Q.: Fine-grained video-text retrieval with hierarchical graph reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10 638–10 647 (2020)

  39. Chen, S., Jin, Q., Wang, P., Wu, Q.: Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9962–9971(2020)

  40. Yang S., Li G., Yu. Y.: Cross-modal relationship inference for grounding referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4145–4154 (2019)

  41. Yang S., Bang and Liu, Fenglin and Zhang, Can and Zou, Yuexian.: Non-Autoregressive Coarse-to-Fine Video Captioning. arXiv: 1911.12018 (2019)

  42. Rohrbach M., Amin S., Andriluka M., Schiele B.: A database for fine grained activity detection of cooking activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1194–1201 (2012)

  43. Das P., Xu C., Doell R.F., Corso J.J. .: A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2634–2641 (2013)

  44. Rohrbach M., Regneri M., Andriluka M., Amin S., Pinkal M., Schiele B.: Script data for attribute-based recognition of composite activities. In: European conference on computer vision, Springer, pp. 144–157 (2012)

  45. Rohrbach A., Rohrbach M., Qiu W., Friedrich A., Pinkal M., Schiele B.: Coherent multi-sentence video description with variable level of detail. In: German conference on pattern recognition,Springer, pp. 184–195 (2014)

  46. Zhou L., Xu C., Corso J.: Towards automatic learning of procedures from web instructional videos. In: Association for the Advancement of Artificial Intelligence, pp. 7590–7598 (2018)

  47. Rohrbach A., Rohrbach M., Tandon N., Schiele B.: A dataset for movie description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3202–3212 (2015)

  48. Torabi A., Pal C., Larochelle H., Courville A.: Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv: 1503.01070 (2015)

  49. Aafaq N., Mian A., Liu W., Gilani S.Z., Shah, M.: Video description: A survey of methods, datasets, and evaluation metrics. ACM Comput. Surv. 52(6), 1–37 (2019)

    Article  Google Scholar 

  50. He K., Zhang X., Ren S., Sun J.: Deep Residual Learning for Image Recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  51. Carreira J., Zisserman A.: Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. arXiv: 1705.07750 (2017)

  52. Hershey S., Chaudhuri S., Ellis D.P.W., Gemmeke J.F., Jansen A., Moore R.C., Plakal M., Platt D., Saurous R.A., Seybold B.: CNN Architectures for Large-Scale Audio Classification. In: the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 131–135 (2017)

  53. Chen H., Lin K., Maye A., Li J., Hu X.: A semantics-assisted video captioning model trained with scheduled sampling. arXiv: 1909.00121 (2019)

  54. Sun B., Yu L., Zhao Y., He J.J.I.I.P.: Feedback evaluations to promote image captioningFeedback evaluations to promote image captioning. In IET Image Processing, pp 3021–3027 (2020)

  55. Denkowski M., Lavie.: A Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation, pp. 376–380(2014)

  56. Papineni K., Roukos S., Ward T., Zhu W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 (2002)

  57. Vedantam R., Lawrence Zitnick C., Parikh D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575 (2015)

  58. Lin C.-Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74–81(2004)

  59. Lin T.Y., Maire M., Belongie S., Hays J., Perona P., Ramanan D., Dollár P., Zitnick C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755(2014)

  60. Xu J, Yao T, Zhang Y, et al.: Learning multimodal attention LSTM networks for video captioning. In: Proceedings of the 25th ACM international conference on Multimedia. Pp 537–545. (2017)

  61. Chen S., Jiang Y.-G.: Motion guided spatial attention for video captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8191–8198 (2019)

  62. Olivastri S., Singh G., Cuzzolin F.J.C.: An End-to-End Baseline for Video Captioning. In: Proceedings of the IEEE International Conference on Computer Vision Workshop, pp 2993–3000 (2019)

  63. Pasunuru R., Bansal M.Japa.: Continual and multi-task architecture search. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. arXiv: 1906.05226 (2019)

  64. Pei W., Zhang J., Wang X., Ke L., Shen X., Tai Y.-W.: Memory-attended recurrent network for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8347–8356 (2019)

  65. Zheng Q., Wang C., Tao D.: Syntax-Aware Action Targeting for Video Captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 13096–13105 (2020)

  66. Zhang Z., Shi Y., Yuan C., Li B., Wang P., Hu W., Zha Z.: Object Relational Graph with Teacher-Recommended Learning for Video Captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 13275–13285 (2020)

  67. Guo L., Liu J., Zhu X., Yao P., Shichen L., Lu H.: Normalized and geometry-aware self-attention network for image captioning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10324–10333 (2020)

  68. Wang B., Ma L., Zhang W., Liu W.: Reconstruction network for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7622–7631 (2018)

  69. Sun B., Wu Y., Zhao K., et al.: Student Class Behavior Dataset: a video dataset for recognizing, detecting, and captioning students' behaviors in classroom scenes. Neural Computing and Applications, pp. 1–20. (2021)

Download references

Acknowledgements

This work is supported by the National Science Foundation of China (Grant No. 62077009, 62177006) and Zhuhai Science and Technology Planning Project (Grant No. ZH22036201210161PWC).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun He.

Ethics declarations

Conflict of interest

The authors declared that they have no conflicts of interest with regard to this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Data availability statement

The data included in this study may be available upon reasonable request by contacting with the corresponding author.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, B., Wu, Y., Zhao, Y. et al. Cross-language multimodal scene semantic guidance and leap sampling for video captioning. Vis Comput 39, 9–25 (2023). https://doi.org/10.1007/s00371-021-02309-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-021-02309-w

Keywords

Navigation