Skip to main content

Benefit from AMR: Image Captioning with Explicit Relations and Endogenous Knowledge

  • Conference paper
  • First Online:
Web and Big Data (APWeb-WAIM 2023)

Abstract

Recent advanced image captioning methods mostly explore implicit relationships among objects by object-based visual feature modeling, while failing to capture the explicit relations and achieve semantic association. To tackle these problems, we present a novel method based on Abstract Meaning Representation (AMR) in this paper. Specifically, in addition to implicit relationship modeling of visual features, we design an AMR generator to extract explicit relations of images and further model these relations during generation. Besides, we construct an AMR-based endogenous knowledge graph, which helps extract prior knowledge for semantic association, strengthening the semantic expression ability of the captioning model without any external resources. Extensive experiments are conducted on the public MS COCO dataset, and results show that the AMR-based explicit semantic features and the associated semantic features can further boost image captioning to generate higher-quality captions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://cocodataset.org/.

  2. 2.

    https://github.com/tylin/coco-caption.

References

  1. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)

    Google Scholar 

  2. Banarescu, L., et al.: Abstract meaning representation for sembanking. In: LAW@ACL (2013)

    Google Scholar 

  3. Chatterjee, R., Weller, M., Negri, M., Turchi, M.: Exploring the planet of the apes: a comparative study of state-of-the-art methods for MT automatic post-editing. In: In 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, vol. 2, pp. 156–161 (2015)

    Google Scholar 

  4. Chen, F., Xie, S., Li, X., Li, S., Tang, J., Wang, T.: What topics do images say: a neural image captioning model with topic representation. In: In IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 447–452. IEEE (2019)

    Google Scholar 

  5. Chen, L., et al.: SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6298–6306. IEEE (2017)

    Google Scholar 

  6. Chen, S., Jin, Q., Wang, P., Wu, Q.: Say as you wish: fine-grained control of image caption generation with abstract scene graphs. In: In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9962–9971 (2020)

    Google Scholar 

  7. Dong, G., Zhang, X., Lan, L. et al. Label guided correlation hashing for large-scale cross-modal retrieval. Multimed. Tools Appl. 78, 30895–30922 (2019). https://doi.org/10.1007/s11042-019-7192-5

  8. Feng, Y., Chen, X., Lin, B.Y., Wang, P., Yan, J., Ren, X.: Scalable multi-hop relational reasoning for knowledge-aware question answering. In: In Conference on Empirical Methods in Natural Language Processing (2020)

    Google Scholar 

  9. Flanigan, J., Thomson, S., Carbonell, J., Dyer, C., Smith, N.A.: A discriminative graph-based parser for the abstract meaning representation. In: In Annual Meeting of the Association for Computational Linguistics (2014)

    Google Scholar 

  10. Gao, L., Fan, K., Song, J., Liu, X., Xu, X., Shen, H.T.: Deliberate attention networks for image captioning. In: In AAAI Conference on Artificial Intelligence (2019)

    Google Scholar 

  11. Gu, J., Cai, J., Wang, G., Chen, T.: Stack-captioning: Coarse-to-fine learning for image captioning. In: AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  12. He, C., Hu, H.: Image captioning with visual-semantic double attention. ACM Trans. Multimed. Comput. Commun. Appl. 15(1), 26 (2019)

    Article  Google Scholar 

  13. Huang, F., Li, Z., Chen, S., Zhang, C., Ma, H.: Image captioning with internal and external knowledge. In 29th ACM International Conference on Information and Knowledge Management (2020)

    Google Scholar 

  14. Huang, L., Wang, W., Chen, J., Wei, X.Y.: Attention on attention for image captioning. In: In IEEE International Conference on Computer Vision, pp. 4634–4643 (2019)

    Google Scholar 

  15. Huang, L., Wang, W., Xia, Y., Chen, J.: Adaptively aligned image captioning via adaptive attention time. In: In Advances in Neural Information Processing Systems, pp. 8942–8951 (2019)

    Google Scholar 

  16. Huang, Y., Chen, J., Ouyang, W., Wan, W., Xue, Y.: Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Trans. Image Process. 29, 4013–4026 (2020)

    Article  Google Scholar 

  17. Ji, J., Xu, C., Zhang, X., Wang, B., Song, X.: Spatio-temporal memory attention for image captioning. IEEE Trans. Image Process. 29, 7615–7628 (2020)

    Article  Google Scholar 

  18. Jiang, W., Ma, L., Jiang, Y.G., Liu, W., Zhang, T.: Recurrent fusion network for image captioning. In: In European Conference on Computer Vision, pp. 499–515 (2018)

    Google Scholar 

  19. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)

    Google Scholar 

  20. Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017)

    Article  MathSciNet  Google Scholar 

  21. Liu, D., Zha, Z.J., Zhang, H., Zhang, Y., Wu, F.: Context-aware visual policy network for sequence-level image captioning. In: In 26th ACM international conference on Multimedia, pp. 1416–1424 (2018)

    Google Scholar 

  22. Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 6, p. 2 (2017)

    Google Scholar 

  23. Lyu, N.H.M.F.F.: TSFNET: triple-steam image captioning. IEEE Trans. Multimedia 25, 1–14 (2022)

    Google Scholar 

  24. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)

    Google Scholar 

  25. Tan, H., Zhang, X., Lan, L., Huang, X., Luo, Z.: Nonnegative constrained graph based canonical correlation analysis for multi-view feature learning. Neural Processing Letters, pp. 1–26 (2018)

    Google Scholar 

  26. Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)

    Google Scholar 

  27. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2016)

    Article  Google Scholar 

  28. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)

    Google Scholar 

  29. Wang, Y., Xu, J., Sun, Y.: A visual persistence model for image captioning. Neurocomputing 468, 48–59 (2022)

    Article  Google Scholar 

  30. Wu, Q., Shen, C., Wang, P., Dick, A., Hengel, A.V.: Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1367–1381 (2018)

    Article  Google Scholar 

  31. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Computer Science, pp. 2048–2057 (2015)

    Google Scholar 

  32. Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10685–10694 (2019)

    Google Scholar 

  33. Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: European Conference on Computer Vision, pp. 684–699 (2018)

    Google Scholar 

  34. Yao, T., Pan, Y., Li, Y., Mei, T.: Hierarchy parsing for image captioning. In: IEEE/CVF International Conference on Computer Vision, pp. 2621–2629 (2019)

    Google Scholar 

  35. Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: IEEE International Conference on Computer Vision, pp. 22–29 (2017)

    Google Scholar 

  36. You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659 (2016)

    Google Scholar 

  37. Zhang, J., Mei, K., Zheng, Y., Fan, J.: Integrating part of speech guidance for image captioning. IEEE Trans. Multimedia 23, 92–104 (2020)

    Article  Google Scholar 

  38. Zhang, S., Ma, X., Duh, K., Durme, B.V.: Amr parsing as sequence-to-graph transduction. In: Annual Meeting of the Association for Computational Linguistics (2019)

    Google Scholar 

  39. Zhou, Y., Sun, Y., Honavar, V.G.: Improving image captioning by leveraging knowledge graphs. In: IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 283–293 (2019)

    Google Scholar 

  40. Zhou, Y., Wang, M., Liu, D., Hu, Z., Zhang, H.: More grounded image captioning by distilling image-text matching model. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4777–4786 (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ting Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, F., Li, X., Tang, J., Li, S., Wang, T. (2024). Benefit from AMR: Image Captioning with Explicit Relations and Endogenous Knowledge. In: Song, X., Feng, R., Chen, Y., Li, J., Min, G. (eds) Web and Big Data. APWeb-WAIM 2023. Lecture Notes in Computer Science, vol 14332. Springer, Singapore. https://doi.org/10.1007/978-981-97-2390-4_25

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-2390-4_25

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-2389-8

  • Online ISBN: 978-981-97-2390-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics