Abstract
Image captioning is a challenging task in the research area of vision and language. Traditionally in a deep learning-based image captioning model, two types of input features are utilized for generating the token of the current inference step, including the attended visual feature and the previous word embedding. However, the sentence level embeddings are ignored in this typical working pipeline for captioning. In this paper, we propose Intrinsic Cross-Modality Captioning (ICMC), a new method to improve image captioning with sentence level embedding and Cross-Modality Alignment. The novelty of our proposed model mainly comes from the text encoder and the Cross-Modality module. In the feature encoding stage, we use an adaptation module to map the global visual features to the joint domain. In the decoding stage we then use the adapted features to guide the visual attention process with the RCNN features. With the proposed method we not only attend to the visual features and previous word for captions but also include the sentence level clues from the ground truths at training phase. The evaluation on the benchmark of MSCOCO and extensive ablation studies are performed to validate the effectiveness of the proposed method.
Similar content being viewed by others
References
Chen B, Zhao T, Liu J, Lin L (2021) Multipath feature recalibration densenet for image classification. Int J Mach Learn Cybern 12(3):651–660
Tian D, Shi Z (2020) A two-stage hybrid probabilistic topic model for refining image annotation. Int J Mach Learn Cybern 11(2):417–431
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6077–6086
Luo R, Price B, Cohen S, Shakhnarovich G (2018) Discriminability objective for training descriptive captions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6964–6974
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning. PMLR 2048–2057
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3156–3164
Mao J, Xu W, Yang Y, Wang J, Yuille A L (2014) Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090
Dai B, Fidler S, Urtasun R, Lin D (2017) Towards diverse and natural image descriptions via a conditional gan. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2970–2979
Aneja J, Deshpande A, Schwing A G (2018) Convolutional image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5561–5570
Li N, Chen Z (2018) Image cationing with visual-semantic lstm. In: IJCAI, pp 793–799
Yao T, Pan Y, Li Y, Mei T (2017) Incorporating copying mechanism in image captioning for learning novel objects. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6580–6588
Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7219–7228
Gu J, Cai J, Wang G, Chen T (2018) Stack-captioning: coarse-to-fine learning for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 32, Issue 1. pp 6837–6844
Chen F, Ji R, Sun X, Wu Y, Su J (2018) Groupcap: Group-based image captioning with structured relevance and diversity constraints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1345–1353
Guo L, Liu J, Tang J, Li J, Luo W, Lu H (2019) Aligning linguistic words and visual semantic units for image captioning. In: Proceedings of the 27th ACM International Conference on Multimedia, pp 765–773
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4651–4659
Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4634–4643
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10 578
Guo L, Liu J, Zhu X, He X, Jiang J, Lu H (2020) Non-autoregressive image captioning with counterfactuals-critical multi-agent learning. arXiv preprint arXiv:2005.04690
Wang J, Xu W, Wang Q, Chan A B (2020) Compare and reweight: distinctive image captioning using similar images sets. In: European Conference on Computer Vision. Springer, pp 370–386
Albawi S, Mohammed T A, Al-Zawi S (2017) Understanding of a convolutional neural network. In: 2017 International Conference on Engineering and Technology (ICET). Ieee, pp 1–6
Medsker LR, Jain L (2001) Recurrent neural networks. Design Appl 5:64–67
Liu F, Ren X, Liu Y, Wang H, Sun X (2018) simnet: Stepwise image-topic merging network for generating detailed and comprehensive image captions. arXiv preprint arXiv:1808.08732
Liu F, Liu Y, Ren X, He X, Sun X (2019) Aligning visual regions and textual concepts for semantic-grounded image representations. arXiv preprint arXiv:1905.06139
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Ali MNY, Rahman ML, Chaki J, Dey N, Santosh K (2021) Machine translation using deep learning for universal networking language based on their structure. Int J Mach Learn Cybern 12:2365–2376
Wang H, Zhang Y, Ji Z, Pang Y, Ma L (2020) Consensus-aware visual-semantic embedding for image-text matching. In: European Conference on Computer Vision. Springer, pp 18–34
Faghri F, Fleet D J, Kiros J R, Fidler S (2017) Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612
Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell 41(2):394–407
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 375–383
Ren Z, Wang X, Zhang N, Lv X, Li L-J (2017) Deep reinforcement learning-based image captioning with embedding reward. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 290–298
Rennie S J, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7008–7024
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144
Shetty R, Rohrbach M, Anne Hendricks L, Fritz M, Schiele B (2017) Speaking the same language: Matching machine to human captions by adversarial training. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4135–4144
Gu J, Joty S, Cai J, Wang G (2018) Unpaired image captioning by language pivoting. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 503–519
Feng Y, Ma L, Liu W, Luo J (2019) Unsupervised image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4125–4134
Gu J, Joty S, Cai J, Zhao H, Yang X, Wang G (2019) Unpaired image captioning via scene graph alignments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10 323–10 332
Hotelling H (1992) Relations between two sets of variates. In: Breakthroughs in statistics. Springer, pp 162–190
Wang W, Livescu K (2015) Large-scale approximate kernel canonical correlation analysis. arXiv preprint arXiv:1511.04773
Wang S, Wang R, Yao Z, Shan S, Chen X (2020) Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 1508–1517
Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 201–216
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L (2014) Microsoft coco: common objects in context. In: European Conference on Computer Vision. Springer, pp 740–755
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3128–3137
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318
Lin C-Y (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4566–4575
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European Conference on Computer Vision. Springer, pp 382–398
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv preprint arXiv:1706.03762
Kingma D P, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Acknowledgements
This work is supported by Key Project of Science and Technology Innovation 2030 supported by the Ministry of Science and Technology of China (Grant No. 2018AAA0101301), and in part by the Hong Kong GRF-RGC General Research Fund under Grant 11209819 (CityU 9042816) and Grant 11203820 (9042598).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zeng, C., Kwong, S. Learning cross-modality features for image caption generation. Int. J. Mach. Learn. & Cyber. 13, 2059–2070 (2022). https://doi.org/10.1007/s13042-022-01506-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-022-01506-w