Skip to main content

Hierarchical Multimodal Attention Network Based on Semantically Textual Guidance for Video Captioning

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13625))

Included in the following conference series:

Abstract

Utilizing multiple modal information to understand video semantics is quite natural when humans watch a video and describe its contents with natural language. In this paper, a hierarchical multimodal attention network that promotes the information interactions of visual-textual and visual-visual is proposed for video captioning, which is composed of two attention modules to learn multimodal visual representations in a hierarchical manner. Specifically, visual-textual attention modules are designed for achieving the alignment of the semantic textual guidance and global-local visual representations, thereby leading to a comprehensive understanding of the video-language correspondence. Moreover, the joint modeling of diverse visual representations is learned by the visual-visual attention modules, which can generate compact and powerful video representations to the caption model. Extensive experiments on two public benchmark datasets demonstrate that our approach is pretty competitive with the state-of-the-art methods.

This work was supported by the Scientific Research Project of Tianjin Educational Committee under Grant 2021KJ037 and Fundamental Research Funds for the Central Universities under Grant 3122021052.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aafaq, N., Akhtar, N., Liu, W., Gilani, S.Z., Mian, A.: Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12487–12496 (2019)

    Google Scholar 

  2. Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72 (2005)

    Google Scholar 

  3. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

    Google Scholar 

  4. Chen, D., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp. 190–200 (2011)

    Google Scholar 

  5. Chen, S., Jiang, Y.G.: Motion guided spatial attention for video captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8191–8198 (2019)

    Google Scholar 

  6. Chen, S., Jiang, Y.G.: Motion guided region message passing for video captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1543–1552 (2021)

    Google Scholar 

  7. Chen, Y., Wang, S., Zhang, W., Huang, Q.: Less is more: picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV), pp. 358–373 (2018)

    Google Scholar 

  8. Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74–81 (2004)

    Google Scholar 

  9. Pan, B., et al.: Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10870–10879 (2020)

    Google Scholar 

  10. Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029–1038 (2016)

    Google Scholar 

  11. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  12. Pei, W., Zhang, J., Wang, X., Ke, L., Shen, X., Tai, Y.W.: Memory-attended recurrent network for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8347–8356 (2019)

    Google Scholar 

  13. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)

    Article  Google Scholar 

  14. Ryu, H., Kang, S., Kang, H., Yoo, C.D.: Semantic grouping network for video captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2514–2522 (2021)

    Google Scholar 

  15. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: 31st AAAI Conference on Artificial Intelligence (2017)

    Google Scholar 

  16. Tan, G., Liu, D., Wang, M., Zha, Z.J.: Learning to discretely compose reasoning module networks for video captioning. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence, pp. 745–752 (2020)

    Google Scholar 

  17. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)

    Google Scholar 

  18. Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-vtt: a large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5288–5296 (2016)

    Google Scholar 

  19. Yao, L., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)

    Google Scholar 

  20. You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659 (2016)

    Google Scholar 

  21. Zhang, Z., et al.: Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13278–13288 (2020)

    Google Scholar 

  22. Zheng, Q., Wang, C., Tao, D.: Syntax-aware action targeting for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13096–13105 (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tao Xu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, C., Ma, X., He, X., Xu, T. (2023). Hierarchical Multimodal Attention Network Based on Semantically Textual Guidance for Video Captioning. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Lecture Notes in Computer Science, vol 13625. Springer, Cham. https://doi.org/10.1007/978-3-031-30111-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-30111-7_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-30110-0

  • Online ISBN: 978-3-031-30111-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics