Learning cross-modality features for image caption generation

Zeng, Chao; Kwong, Sam

doi:10.1007/s13042-022-01506-w

Learning cross-modality features for image caption generation

Original Article
Published: 25 March 2022

Volume 13, pages 2059–2070, (2022)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Chao Zeng¹ &
Sam Kwong¹

432 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Image captioning is a challenging task in the research area of vision and language. Traditionally in a deep learning-based image captioning model, two types of input features are utilized for generating the token of the current inference step, including the attended visual feature and the previous word embedding. However, the sentence level embeddings are ignored in this typical working pipeline for captioning. In this paper, we propose Intrinsic Cross-Modality Captioning (ICMC), a new method to improve image captioning with sentence level embedding and Cross-Modality Alignment. The novelty of our proposed model mainly comes from the text encoder and the Cross-Modality module. In the feature encoding stage, we use an adaptation module to map the global visual features to the joint domain. In the decoding stage we then use the adapted features to guide the visual attention process with the RCNN features. With the proposed method we not only attend to the visual features and previous word for captions but also include the sentence level clues from the ground truths at training phase. The evaluation on the benchmark of MSCOCO and extensive ablation studies are performed to validate the effectiveness of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Survey on CAPTCHA Recognition: Types, Creation and Breaking Techniques

Article 14 June 2021

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Article 10 June 2021

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

References

Chen B, Zhao T, Liu J, Lin L (2021) Multipath feature recalibration densenet for image classification. Int J Mach Learn Cybern 12(3):651–660
Article Google Scholar
Tian D, Shi Z (2020) A two-stage hybrid probabilistic topic model for refining image annotation. Int J Mach Learn Cybern 11(2):417–431
Article Google Scholar
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6077–6086
Luo R, Price B, Cohen S, Shakhnarovich G (2018) Discriminability objective for training descriptive captions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6964–6974
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning. PMLR 2048–2057
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3156–3164
Mao J, Xu W, Yang Y, Wang J, Yuille A L (2014) Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090
Dai B, Fidler S, Urtasun R, Lin D (2017) Towards diverse and natural image descriptions via a conditional gan. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2970–2979
Aneja J, Deshpande A, Schwing A G (2018) Convolutional image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5561–5570
Li N, Chen Z (2018) Image cationing with visual-semantic lstm. In: IJCAI, pp 793–799
Yao T, Pan Y, Li Y, Mei T (2017) Incorporating copying mechanism in image captioning for learning novel objects. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6580–6588
Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7219–7228
Gu J, Cai J, Wang G, Chen T (2018) Stack-captioning: coarse-to-fine learning for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 32, Issue 1. pp 6837–6844
Chen F, Ji R, Sun X, Wu Y, Su J (2018) Groupcap: Group-based image captioning with structured relevance and diversity constraints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1345–1353
Guo L, Liu J, Tang J, Li J, Luo W, Lu H (2019) Aligning linguistic words and visual semantic units for image captioning. In: Proceedings of the 27th ACM International Conference on Multimedia, pp 765–773
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4651–4659
Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4634–4643
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10 578
Guo L, Liu J, Zhu X, He X, Jiang J, Lu H (2020) Non-autoregressive image captioning with counterfactuals-critical multi-agent learning. arXiv preprint arXiv:2005.04690
Wang J, Xu W, Wang Q, Chan A B (2020) Compare and reweight: distinctive image captioning using similar images sets. In: European Conference on Computer Vision. Springer, pp 370–386
Albawi S, Mohammed T A, Al-Zawi S (2017) Understanding of a convolutional neural network. In: 2017 International Conference on Engineering and Technology (ICET). Ieee, pp 1–6
Medsker LR, Jain L (2001) Recurrent neural networks. Design Appl 5:64–67
Google Scholar
Liu F, Ren X, Liu Y, Wang H, Sun X (2018) simnet: Stepwise image-topic merging network for generating detailed and comprehensive image captions. arXiv preprint arXiv:1808.08732
Liu F, Liu Y, Ren X, He X, Sun X (2019) Aligning visual regions and textual concepts for semantic-grounded image representations. arXiv preprint arXiv:1905.06139
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Ali MNY, Rahman ML, Chaki J, Dey N, Santosh K (2021) Machine translation using deep learning for universal networking language based on their structure. Int J Mach Learn Cybern 12:2365–2376
Article Google Scholar
Wang H, Zhang Y, Ji Z, Pang Y, Ma L (2020) Consensus-aware visual-semantic embedding for image-text matching. In: European Conference on Computer Vision. Springer, pp 18–34
Faghri F, Fleet D J, Kiros J R, Fidler S (2017) Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612
Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell 41(2):394–407
Article Google Scholar
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 375–383
Ren Z, Wang X, Zhang N, Lv X, Li L-J (2017) Deep reinforcement learning-based image captioning with embedding reward. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 290–298
Rennie S J, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7008–7024
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144
Article MathSciNet Google Scholar
Shetty R, Rohrbach M, Anne Hendricks L, Fritz M, Schiele B (2017) Speaking the same language: Matching machine to human captions by adversarial training. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4135–4144
Gu J, Joty S, Cai J, Wang G (2018) Unpaired image captioning by language pivoting. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 503–519
Feng Y, Ma L, Liu W, Luo J (2019) Unsupervised image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4125–4134
Gu J, Joty S, Cai J, Zhao H, Yang X, Wang G (2019) Unpaired image captioning via scene graph alignments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10 323–10 332
Hotelling H (1992) Relations between two sets of variates. In: Breakthroughs in statistics. Springer, pp 162–190
Wang W, Livescu K (2015) Large-scale approximate kernel canonical correlation analysis. arXiv preprint arXiv:1511.04773
Wang S, Wang R, Yao Z, Shan S, Chen X (2020) Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 1508–1517
Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 201–216
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L (2014) Microsoft coco: common objects in context. In: European Conference on Computer Vision. Springer, pp 740–755
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3128–3137
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318
Lin C-Y (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4566–4575
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European Conference on Computer Vision. Springer, pp 382–398
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv preprint arXiv:1706.03762
Kingma D P, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

Download references

Acknowledgements

This work is supported by Key Project of Science and Technology Innovation 2030 supported by the Ministry of Science and Technology of China (Grant No. 2018AAA0101301), and in part by the Hong Kong GRF-RGC General Research Fund under Grant 11209819 (CityU 9042816) and Grant 11203820 (9042598).

Author information

Authors and Affiliations

Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
Chao Zeng & Sam Kwong

Authors

Chao Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Sam Kwong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sam Kwong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zeng, C., Kwong, S. Learning cross-modality features for image caption generation. Int. J. Mach. Learn. & Cyber. 13, 2059–2070 (2022). https://doi.org/10.1007/s13042-022-01506-w

Download citation

Received: 20 August 2021
Accepted: 06 January 2022
Published: 25 March 2022
Issue Date: July 2022
DOI: https://doi.org/10.1007/s13042-022-01506-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning cross-modality features for image caption generation

Abstract

Access this article

Similar content being viewed by others

A Systematic Survey on CAPTCHA Recognition: Types, Creation and Breaking Techniques

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning cross-modality features for image caption generation

Abstract

Access this article

Similar content being viewed by others

A Systematic Survey on CAPTCHA Recognition: Types, Creation and Breaking Techniques

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation