research-article

Diverse Image Captioning via Conditional Variational Autoencoder and Dual Contrastive Learning

Authors:
Jing Xu

School of Computer Science and Technology, China University of Mining and Technology, Engineering Research Center of Mine Digitization, Ministry of Education of the Peoples Republic of China, Xuzhou, Jiangsu, China

School of Computer Science and Technology, China University of Mining and Technology, Engineering Research Center of Mine Digitization, Ministry of Education of the Peoples Republic of China, Xuzhou, Jiangsu, China

0000-0002-2980-1755
View Profile

,
Bing Liu

School of Computer Science and Technology, China University of Mining and Technology, Engineering Research Center of Mine Digitization, Ministry of Education of the Peoples Republic of China, Xuzhou, Jiangsu, China

School of Computer Science and Technology, China University of Mining and Technology, Engineering Research Center of Mine Digitization, Ministry of Education of the Peoples Republic of China, Xuzhou, Jiangsu, China

0000-0002-2365-6606
View Profile

,
Yong Zhou

School of Computer Science and Technology, China University of Mining and Technology, Engineering Research Center of Mine Digitization, Ministry of Education of the Peoples Republic of China, Xuzhou, Jiangsu, China

School of Computer Science and Technology, China University of Mining and Technology, Engineering Research Center of Mine Digitization, Ministry of Education of the Peoples Republic of China, Xuzhou, Jiangsu, China

0000-0001-6207-0299
View Profile

,
Mingming Liu

School of Computer Science and Technology, China University of Mining and Technology, Engineering Research Center of Mine Digitization, Ministry of Education of the Peoples Republic of China, Xuzhou, Jiangsu, China

School of Computer Science and Technology, China University of Mining and Technology, Engineering Research Center of Mine Digitization, Ministry of Education of the Peoples Republic of China, Xuzhou, Jiangsu, China

0000-0002-5698-8308
View Profile

,
Rui Yao

School of Computer Science and Technology, China University of Mining and Technology, Engineering Research Center of Mine Digitization, Ministry of Education of the Peoples Republic of China, Xuzhou, Jiangsu, China

School of Computer Science and Technology, China University of Mining and Technology, Engineering Research Center of Mine Digitization, Ministry of Education of the Peoples Republic of China, Xuzhou, Jiangsu, China

0000-0003-2734-915X
View Profile

,
Zhiwen Shao

School of Computer Science and Technology, China University of Mining and Technology, Engineering Research Center of Mine Digitization, Ministry of Education of the Peoples Republic of China, Xuzhou, Jiangsu, China

School of Computer Science and Technology, China University of Mining and Technology, Engineering Research Center of Mine Digitization, Ministry of Education of the Peoples Republic of China, Xuzhou, Jiangsu, China

0000-0002-9383-8384
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20 Issue 1Article No.: 29pp 1–16https://doi.org/10.1145/3614435

Published:18 September 2023Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Diverse image captioning has achieved substantial progress in recent years. However, the discriminability of generative models and the limitation of cross entropy loss are generally overlooked in the traditional diverse image captioning models, which seriously hurts both the diversity and accuracy of image captioning. In this article, aiming to improve diversity and accuracy simultaneously, we propose a novel Conditional Variational Autoencoder (DCL-CVAE) framework for diverse image captioning by seamlessly integrating sequential variational autoencoder with contrastive learning. In the encoding stage, we first build conditional variational autoencoders to separately learn the sequential latent spaces for a pair of captions. Then, we introduce contrastive learning in the sequential latent spaces to enhance the discriminability of latent representations for both image-caption pairs and mismatched pairs. In the decoding stage, we leverage the captions sampled from the pre-trained Long Short-Term Memory (LSTM), LSTM decoder as the negative examples and perform contrastive learning with the greedily sampled positive examples, which can restrain the generation of common words and phrases induced by the cross entropy loss. By virtue of dual constrastive learning, DCL-CVAE is capable of encouraging the discriminability and facilitating the diversity, while promoting the accuracy of the generated captions. Extensive experiments are conducted on the challenging MSCOCO dataset, showing that our proposed methods can achieve a better balance between accuracy and diversity compared to the state-of-the-art diverse image captioning models.

REFERENCES

[1] Goyal Anirudh Goyal Alias Parth, Sordoni Alessandro, Côté Marc-Alexandre, Ke Nan Rosemary, and Bengio Yoshua. 2017. Z-forcing: Training stochastic recurrent networks. In Proceedings of the 31st Conference on Neural Information Processing Systems. 6697–6707.Google Scholar
[2] Anderson Peter, Fernando Basura, Johnson Mark, and Gould Stephen. 2016. Spice: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision (ECCV’16). Springer, 382–398.Google ScholarCross Ref
[3] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 6077–6086.Google ScholarCross Ref
[4] Aneja Jyoti, Agrawal Harsh, Batra Dhruv, and Schwing Alexander. 2019. Sequential latent spaces for modeling the intention during diverse image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 4261–4270.Google ScholarCross Ref
[5] Chen Xinlei and Zitnick C. Lawrence. 2015. Mind’s eye: A recurrent visual representation for image caption generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 2422–2431.Google ScholarCross Ref
[6] Chung Junyoung, Kastner Kyle, Dinh Laurent, Goel Kratarth, Courville Aaron C., and Bengio Yoshua. 2015. A recurrent latent variable model for sequential data. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15). 2980–2988.Google Scholar
[7] Cornia Marcella, Stefanini Matteo, Baraldi Lorenzo, and Cucchiara Rita. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 10578–10587.Google ScholarCross Ref
[8] Dai Bo and Lin Dahua. 2017. Contrastive learning for image captioning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). 898–907.Google ScholarDigital Library
[9] Denkowski Michael and Lavie Alon. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the 9th Workshop on Statistical Machine Translation. 376–380.Google ScholarCross Ref
[10] Deshpande Aditya, Aneja Jyoti, Wang Liwei, Schwing Alexander G., and Forsyth David. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 10695–10704.Google ScholarCross Ref
[11] Devlin Jacob, Gupta Saurabh, Girshick Ross, Mitchell Margaret, and Zitnick C. Lawrence. 2015. Exploring nearest neighbor approaches for image captioning. CoRRabs/1505.04467, 2015.Google Scholar
[12] Donahue Jeffrey, Hendricks Lisa Anne, Guadarrama Sergio, Rohrbach Marcus, Venugopalan Subhashini, Saenko Kate, and Darrell Trevor. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 2625–2634.Google ScholarCross Ref
[13] Farhadi Ali, Hejrati Mohsen, Sadeghi Mohammad Amin, Young Peter, Rashtchian Cyrus, Hockenmaier Julia, and Forsyth David. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the 11th European Conference on Computer Vision (ECCV’10). Springer, 15–29.Google ScholarCross Ref
[14] Fraccaro Marco, Sønderby Søren Kaae, Paquet Ulrich, and Winther Ole. 2016. Sequential neural models with stochastic layers. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16).Google ScholarDigital Library
[15] Freitag Markus and Al-Onaizan Yaser. 2017. Beam search strategies for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation. 56–60.Google ScholarCross Ref
[16] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.Google ScholarDigital Library
[17] Huang Lun, Wang Wenmin, Chen Jie, and Wei Xiao-Yong. 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 4634–4643.Google ScholarCross Ref
[18] Jang Eric, Gu Shixiang, and Poole Ben. 2017. Categorical reparameterization with gumbel-softmax. In Proceedings of the International Conference on Learning Representations (ICLR’17).Google Scholar
[19] Jia Xu, Gavves Efstratios, Fernando Basura, and Tuytelaars Tinne. 2015. Guiding the long-short term memory model for image caption generation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2407–2415.Google ScholarDigital Library
[20] Karpathy Andrej and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3128–3137.Google ScholarCross Ref
[21] Ke Lei, Pei Wenjie, Li Ruiyu, Shen Xiaoyong, and Tai Yu-Wing. 2019. Reflective decoding network for image captioning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 8887–8896.Google ScholarCross Ref
[22] Krause Jonathan, Johnson Justin, Krishna Ranjay, and Fei-Fei Li. 2017. A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 317–325.Google ScholarCross Ref
[23] Kulkarni Girish, Premraj Visruth, Ordonez Vicente, Dhar Sagnik, Li Siming, Choi Yejin, Berg Alexander C., and Berg Tamara L.. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12, 2891–2903.Google ScholarDigital Library
[24] Li Dianqi, Huang Qiuyuan, He Xiaodong, Zhang Lei, and Sun Ming-Ting. 2018. Generating diverse and accurate visual captions by comparative adversarial learning. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS’18).Google Scholar
[25] Lin Chin-Yew. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out. 74–81.Google Scholar
[26] Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). Springer, 740–755.Google ScholarCross Ref
[27] Liu Xihui, Li Hongsheng, Shao Jing, Chen Dapeng, and Wang Xiaogang. 2018. Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer, 338–354.Google ScholarDigital Library
[28] Mahajan Shweta, Gurevych Iryna, and Roth Stefan. 2020. Latent normalizing flows for many-to-many cross-domain mappings. In Proceedings of the 8th International Conference on Learning Representations (ICLR’20).Google Scholar
[29] Mahajan Shweta and Roth Stefan. 2020. Diverse image captioning with context-object split latent spaces. Advances in Neural Information Processing Systems 33 (2020), 3613–3624.Google Scholar
[30] Mao Junhua, Xu Wei, Yang Yi, Wang Jiang, Huang Zhiheng, and Yuille Alan. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15).Google Scholar
[31] Pan Yingwei, Yao Ting, Li Yehao, and Mei Tao. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 10971–10980.Google ScholarCross Ref
[32] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.Google ScholarDigital Library
[33] Ren Li, Qi Guo-Jun, and Hua Kien. 2019. Improving diversity of image captioning through variational autoencoders and adversarial learning. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV’19). 263–272. DOI:Google ScholarCross Ref
[34] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2015), 1137–1149.Google ScholarDigital Library
[35] Ren Zhou, Wang Xiaoyu, Zhang Ning, Lv Xutao, and Li Li-Jia. 2017. Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 290–298.Google ScholarCross Ref
[36] Shetty Rakshith, Rohrbach Marcus, Hendricks Lisa Anne, Fritz Mario, and Schiele Bernt. 2017. Speaking the same language: Matching machine to human captions by adversarial training. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 4135–4144.Google ScholarCross Ref
[37] Sinha Naresh K. and Griscik Michael P.. 1971. A stochastic approximation method. IEEE Transactions on Systems, Man, and Cybernetics4 (1971), 338–344.Google ScholarCross Ref
[38] Jordan Kobus Barnard, Pinar Duygulu, David Forsyth, Nando de Freitas, David M. Blei, and Michael I.. 2003. Matching words and pictures. The Journal of Machine Learning Research 3, 38 (2003), 1107–1135.Google Scholar
[39] Batra Ashwin Vijayakumar, Michael Cogswell, Ramprasaath Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv. 2018. Diverse beam search for improved description of complex scenes. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’18). 7371–7379.Google Scholar
[40] Vedantam Ramakrishna, Zitnick C. Lawrence, and Parikh Devi. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4566–4575.Google ScholarCross Ref
[41] Vered Gilad, Oren Gal, Atzmon Yuval, and Chechik Gal. 2020. Joint optimization for cooperative image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR’19). 8898–8907.Google Scholar
[42] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3156–3164.Google ScholarCross Ref
[43] Wang Jiuniu, Xu Wenjia, Wang Qingzhong, and Chan Antoni B.. 2022. On distinctive image captioning via comparing and reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022), 1–1.Google Scholar
[44] Wang Liwei, Schwing Alexander, and Lazebnik Svetlana. 2017. Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). 5758–5768.Google ScholarDigital Library
[45] Yang Zhilin, Yuan Ye, Wu Yuexin, Cohen William W., and Salakhutdinov Russ R.. 2016. Review networks for caption generation. Advances in Neural Information Processing Systems 29 (2016), 2361–2369.Google Scholar
[46] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2018. Exploring visual relationship for image captioning. In Proceedings of the 15th European Conference on Computer Vision (ECCV’18), Vol. 11218. 711–727.Google ScholarDigital Library
[47] You Quanzeng, Jin Hailin, Wang Zhaowen, Fang Chen, and Luo Jiebo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4651–4659.Google ScholarCross Ref

Index Terms

Diverse Image Captioning via Conditional Variational Autoencoder and Dual Contrastive Learning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding
    2. Natural language processing
      1. Natural language generation

Recommendations

Adversarial and Contrastive Variational Autoencoder for Sequential Recommendation
WWW '21: Proceedings of the Web Conference 2021

Sequential recommendation as an emerging topic has attracted increasing attention due to its important practical significance. Models based on deep learning and attention mechanism have achieved good performance in sequential recommendation. Recently, ...
Read More
Diverse Image Captioning with Grounded Style
Pattern Recognition
Abstract
Stylized image captioning as presented in prior work aims to generate captions that reflect characteristics beyond a factual description of the scene composition, such as sentiments. Such prior work relies on given sentiment identifiers, which are ...
Read More
VAEPP: Variational Autoencoder with a Pull-Back Prior
Neural Information Processing
Abstract
Many approaches to training generative models by distinct training objectives have been proposed in the past. Variational Autoencoder (VAE) is an outstanding model of them based on log-likelihood. In this paper, we propose a novel learnable prior, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 1
January 2024
639 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3613542
Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 September 2023
- Online AM: 11 August 2023
- Accepted: 28 July 2023
- Revised: 1 July 2023
- Received: 16 December 2022
Published in tomm Volume 20, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Diverse image captioning
variational autoencoder
contrastive learning
sequential latent space
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 316
  Total Downloads
- Downloads (Last 12 months)316
- Downloads (Last 6 weeks)40
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Diverse Image Captioning via Conditional Variational Autoencoder and Dual Contrastive Learning

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Adversarial and Contrastive Variational Autoencoder for Sequential Recommendation

Diverse Image Captioning with Grounded Style

VAEPP: Variational Autoencoder with a Pull-Back Prior

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

Share this Publication link

Share on Social Media