Abstract
Recent advanced image captioning methods mostly explore implicit relationships among objects by object-based visual feature modeling, while failing to capture the explicit relations and achieve semantic association. To tackle these problems, we present a novel method based on Abstract Meaning Representation (AMR) in this paper. Specifically, in addition to implicit relationship modeling of visual features, we design an AMR generator to extract explicit relations of images and further model these relations during generation. Besides, we construct an AMR-based endogenous knowledge graph, which helps extract prior knowledge for semantic association, strengthening the semantic expression ability of the captioning model without any external resources. Extensive experiments are conducted on the public MS COCO dataset, and results show that the AMR-based explicit semantic features and the associated semantic features can further boost image captioning to generate higher-quality captions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Banarescu, L., et al.: Abstract meaning representation for sembanking. In: LAW@ACL (2013)
Chatterjee, R., Weller, M., Negri, M., Turchi, M.: Exploring the planet of the apes: a comparative study of state-of-the-art methods for MT automatic post-editing. In: In 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, vol. 2, pp. 156–161 (2015)
Chen, F., Xie, S., Li, X., Li, S., Tang, J., Wang, T.: What topics do images say: a neural image captioning model with topic representation. In: In IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 447–452. IEEE (2019)
Chen, L., et al.: SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6298–6306. IEEE (2017)
Chen, S., Jin, Q., Wang, P., Wu, Q.: Say as you wish: fine-grained control of image caption generation with abstract scene graphs. In: In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9962–9971 (2020)
Dong, G., Zhang, X., Lan, L. et al. Label guided correlation hashing for large-scale cross-modal retrieval. Multimed. Tools Appl. 78, 30895–30922 (2019). https://doi.org/10.1007/s11042-019-7192-5
Feng, Y., Chen, X., Lin, B.Y., Wang, P., Yan, J., Ren, X.: Scalable multi-hop relational reasoning for knowledge-aware question answering. In: In Conference on Empirical Methods in Natural Language Processing (2020)
Flanigan, J., Thomson, S., Carbonell, J., Dyer, C., Smith, N.A.: A discriminative graph-based parser for the abstract meaning representation. In: In Annual Meeting of the Association for Computational Linguistics (2014)
Gao, L., Fan, K., Song, J., Liu, X., Xu, X., Shen, H.T.: Deliberate attention networks for image captioning. In: In AAAI Conference on Artificial Intelligence (2019)
Gu, J., Cai, J., Wang, G., Chen, T.: Stack-captioning: Coarse-to-fine learning for image captioning. In: AAAI Conference on Artificial Intelligence (2018)
He, C., Hu, H.: Image captioning with visual-semantic double attention. ACM Trans. Multimed. Comput. Commun. Appl. 15(1), 26 (2019)
Huang, F., Li, Z., Chen, S., Zhang, C., Ma, H.: Image captioning with internal and external knowledge. In 29th ACM International Conference on Information and Knowledge Management (2020)
Huang, L., Wang, W., Chen, J., Wei, X.Y.: Attention on attention for image captioning. In: In IEEE International Conference on Computer Vision, pp. 4634–4643 (2019)
Huang, L., Wang, W., Xia, Y., Chen, J.: Adaptively aligned image captioning via adaptive attention time. In: In Advances in Neural Information Processing Systems, pp. 8942–8951 (2019)
Huang, Y., Chen, J., Ouyang, W., Wan, W., Xue, Y.: Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Trans. Image Process. 29, 4013–4026 (2020)
Ji, J., Xu, C., Zhang, X., Wang, B., Song, X.: Spatio-temporal memory attention for image captioning. IEEE Trans. Image Process. 29, 7615–7628 (2020)
Jiang, W., Ma, L., Jiang, Y.G., Liu, W., Zhang, T.: Recurrent fusion network for image captioning. In: In European Conference on Computer Vision, pp. 499–515 (2018)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017)
Liu, D., Zha, Z.J., Zhang, H., Zhang, Y., Wu, F.: Context-aware visual policy network for sequence-level image captioning. In: In 26th ACM international conference on Multimedia, pp. 1416–1424 (2018)
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 6, p. 2 (2017)
Lyu, N.H.M.F.F.: TSFNET: triple-steam image captioning. IEEE Trans. Multimedia 25, 1–14 (2022)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Tan, H., Zhang, X., Lan, L., Huang, X., Luo, Z.: Nonnegative constrained graph based canonical correlation analysis for multi-view feature learning. Neural Processing Letters, pp. 1–26 (2018)
Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2016)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Wang, Y., Xu, J., Sun, Y.: A visual persistence model for image captioning. Neurocomputing 468, 48–59 (2022)
Wu, Q., Shen, C., Wang, P., Dick, A., Hengel, A.V.: Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1367–1381 (2018)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Computer Science, pp. 2048–2057 (2015)
Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10685–10694 (2019)
Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: European Conference on Computer Vision, pp. 684–699 (2018)
Yao, T., Pan, Y., Li, Y., Mei, T.: Hierarchy parsing for image captioning. In: IEEE/CVF International Conference on Computer Vision, pp. 2621–2629 (2019)
Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: IEEE International Conference on Computer Vision, pp. 22–29 (2017)
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659 (2016)
Zhang, J., Mei, K., Zheng, Y., Fan, J.: Integrating part of speech guidance for image captioning. IEEE Trans. Multimedia 23, 92–104 (2020)
Zhang, S., Ma, X., Duh, K., Durme, B.V.: Amr parsing as sequence-to-graph transduction. In: Annual Meeting of the Association for Computational Linguistics (2019)
Zhou, Y., Sun, Y., Honavar, V.G.: Improving image captioning by leveraging knowledge graphs. In: IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 283–293 (2019)
Zhou, Y., Wang, M., Liu, D., Hu, Z., Zhang, H.: More grounded image captioning by distilling image-text matching model. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4777–4786 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Chen, F., Li, X., Tang, J., Li, S., Wang, T. (2024). Benefit from AMR: Image Captioning with Explicit Relations and Endogenous Knowledge. In: Song, X., Feng, R., Chen, Y., Li, J., Min, G. (eds) Web and Big Data. APWeb-WAIM 2023. Lecture Notes in Computer Science, vol 14332. Springer, Singapore. https://doi.org/10.1007/978-981-97-2390-4_25
Download citation
DOI: https://doi.org/10.1007/978-981-97-2390-4_25
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2389-8
Online ISBN: 978-981-97-2390-4
eBook Packages: Computer ScienceComputer Science (R0)