Improving stylized caption compatibility with image content by integrating region context

Feng, Junlong; Zhao, Jianping

doi:10.1007/s00521-021-06422-8

Improving stylized caption compatibility with image content by integrating region context

Original Article
Published: 22 October 2021

Volume 34, pages 4151–4163, (2022)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

286 Accesses
1 Altmetric
Explore all metrics

Abstract

Depicting an image in a specific style (e.g., positive, negative, humor, and romantic) is drawing emerging attention. In consideration of the inadequacy of diversity in the stylistic dataset, a larger factual corpus is typically introduced to enhance the correlation between generated caption and image content. Due to the emphasis on emotional expression, the model may neglect the semantic representation, which reduces the consistency of the stylized caption with image object and content. Therefore, based on adversarial training mechanism , we proposed an image captioning system CA-GAN to address this issue. Conditioned on image features and semantic vectors, a refining gate is implemented to obtain the most informative context from sentences. A two-separated LSTM architecture is designed to learn semantic knowledge at a comprehensive level. During adversarial training, the parameters of generator and discriminator are updated with stylistic corpus, interactively. Benefited from these components, the generated caption is capable of integrating sentiment-bearing properties with appropriate factual information, and with a strong correlation to the image. Evaluation results show the outstanding performance of our approach. The linguistic analysis demonstrates that our model improves the consistency, as well as the attractiveness, of the stylized caption with the object and content of image, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

“Factual” or “Emotional”: Stylized Image Captioning with Adaptive Learning and Attention

Towards Generating Stylized Image Captions via Adversarial Training

Leveraging facial expressions as emotional context in image captioning

Article 14 February 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Availability of data and material

The raw/processed data required to reproduce these findings cannot be shared at this time as the data also form part of an ongoing study.

Code availability

The code required to reproduce these findings cannot be shared at this time as the data also form part of an ongoing study.

Notes

References

Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018). Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3242–3250
Xiao X, Wang L, Ding K, Xiang S, Pan C (2019) Dense semantic embedding network for image captioning. Patt Recogn 90:285–296. https://doi.org/10.1016/j.patcog.2019.01.028
Article Google Scholar
Liu F, Ren X, Liu Y, Wang H, Sun X (2018) simNet: stepwise image-topic merging network for generating detailed and comprehensive image captions. In: Proceedings of the 2018 conference on empirical methods in natural language processing 2018, pp 137–149
Liu C, Wang C, Sun F, Rui, Y (2016) Image2Text: a multimodal image captioner. In: Proceedings of the 2016 ACM on multimedia conference, pp 746–748. https://doi.org/10.1145/2964284.2973831
Mathews A, Xie L, He X (2016) SentiCap: generating image descriptions with sentiments. In: Proceedings of the AAAI conference on artificial intelligence, pp 3574–3580
Gan C, Gan Z, He X, Gao J, Deng L (2017) StyleNet: generating attractive visual captions with styles. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 955–964
Mathews A, Xie L, He X (2018) SemStyle: learning to generate stylised image captions using unaligned text. In: Proceedings of the IEEE international conference on computer vision and pattern recognition, pp 8591–8600
Chen T, Zhang Z, You Q, Fang C, Wang Z, Jin H, Luo J (2018) “Factual”or“Emotional”: stylized image captioning with adaptive learning and attention. In: Proceedings of the European conference on computer vision, pp 527’–543
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680
Mohamad Nezami O, Dras M, Wan S, Paris C, Hamey L (2019) Towards generating stylized image captions via adversarial training. In: Nayak A, Sharma A (eds) PRICAI 2019: yrends in artificial intelligence. PRICAI 2019. Lecture notes in computer science, vol 11670. Springer, Cham. https://doi.org/10.1007/978-3-030-29908-8_22
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE international conference on computer vision and pattern recognition, pp 1141–1150
Ranzato M, Chopra S, Auli M, Zaremba W (2015) Sequence level training with recurrent neural networks. arXiv:1511.06732
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: IEEE conference on computer vision and pattern recognition, pp 1179–1195
Zhang L, Sung F, Liu F, Xiang T, Gong S, Yang Y, Hospedales TM (2017) Actor-critic sequence training for image captioning. arXiv:1706.09601
Yu L, Zhang W, Wang J and Yu Y (2017) Seqgan: Sequence generative adversarial nets with policy gradient. In: Proceedings of the AAAI conference on artificial intelligence, pp 2852–2858
Wang K, Wan X (2018) SentiGAN: generating sentimental texts via mixture adversarial networks. In: Proceedings of the twenty-seventh international joint conference on artificial intelligence, pp 4446–4452
Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2018) AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: IEEE/CVF conference on computer vision and pattern recognition, pp 1316–1324
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proc IEEE Conf Comput Vis Patt Recogn 2016:770–778
Google Scholar
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: IEEE conference on computer vision and pattern recognition, pp 6298–6306
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing. https://doi.org/10.3115/v1/D14-1181
Kim Y, Jernite Y, Sontag D, Rush A (2016) Character-aware neural language models. In: Proceedings of the AAAI conference on artificial intelligence, pp 2741–2749
Plummer BA, Wang L, Cervantes CM et al (2017) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. https://doi.org/10.1007/s11263-016-0965-7
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 311–318
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and or summarization, pp 65–72
Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision. Springer, pp 382–398
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412.6980
Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, Lawrence Zitnick C (2015) From captions to visual concepts and back. Proc IEEE Conf Comput Vis Patt Recogn 2015:1473–1482
Google Scholar
Holzinger A, Malle B, Saranti A, Pfeifer B (2021) Towards multi-modal causability with graph neural networks enabling information fusion for explainable AI. Inform Fus 71:28–37
Article Google Scholar
Holzinger A, Malle B, Saranti A, Pfeifer B (2021) KANDINSKYPatterns—an experimental exploration environment for pattern analysis and machine intelligence. arXiv:2103.00519

Download references

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

School of Computer Science and Technology, Changchun University of Science and Technology, Changchun, 130022, China
Junlong Feng & Jianping Zhao

Authors

Junlong Feng
View author publications
You can also search for this author in PubMed Google Scholar
Jianping Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Junlong Feng. The first draft of the manuscript was written by Junlong Feng. And all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Jianping Zhao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Feng, J., Zhao, J. Improving stylized caption compatibility with image content by integrating region context. Neural Comput & Applic 34, 4151–4163 (2022). https://doi.org/10.1007/s00521-021-06422-8

Download citation

Received: 02 March 2021
Accepted: 17 August 2021
Published: 22 October 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s00521-021-06422-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving stylized caption compatibility with image content by integrating region context

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

“Factual” or “Emotional”: Stylized Image Captioning with Adaptive Learning and Attention

Towards Generating Stylized Image Captions via Adversarial Training

Leveraging facial expressions as emotional context in image captioning

Explore related subjects

Availability of data and material

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now