Abstract
Diverse image captioning has achieved substantial progress in recent years. However, the discriminability of generative models and the limitation of cross entropy loss are generally overlooked in the traditional diverse image captioning models, which seriously hurts both the diversity and accuracy of image captioning. In this article, aiming to improve diversity and accuracy simultaneously, we propose a novel Conditional Variational Autoencoder (DCL-CVAE) framework for diverse image captioning by seamlessly integrating sequential variational autoencoder with contrastive learning. In the encoding stage, we first build conditional variational autoencoders to separately learn the sequential latent spaces for a pair of captions. Then, we introduce contrastive learning in the sequential latent spaces to enhance the discriminability of latent representations for both image-caption pairs and mismatched pairs. In the decoding stage, we leverage the captions sampled from the pre-trained Long Short-Term Memory (LSTM), LSTM decoder as the negative examples and perform contrastive learning with the greedily sampled positive examples, which can restrain the generation of common words and phrases induced by the cross entropy loss. By virtue of dual constrastive learning, DCL-CVAE is capable of encouraging the discriminability and facilitating the diversity, while promoting the accuracy of the generated captions. Extensive experiments are conducted on the challenging MSCOCO dataset, showing that our proposed methods can achieve a better balance between accuracy and diversity compared to the state-of-the-art diverse image captioning models.
- [1] . 2017. Z-forcing: Training stochastic recurrent networks. In Proceedings of the 31st Conference on Neural Information Processing Systems. 6697–6707.Google Scholar
- [2] . 2016. Spice: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision (ECCV’16). Springer, 382–398.Google ScholarCross Ref
- [3] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 6077–6086.Google ScholarCross Ref
- [4] . 2019. Sequential latent spaces for modeling the intention during diverse image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 4261–4270.Google ScholarCross Ref
- [5] . 2015. Mind’s eye: A recurrent visual representation for image caption generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 2422–2431.Google ScholarCross Ref
- [6] . 2015. A recurrent latent variable model for sequential data. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15). 2980–2988.Google Scholar
- [7] . 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 10578–10587.Google ScholarCross Ref
- [8] . 2017. Contrastive learning for image captioning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). 898–907.Google ScholarDigital Library
- [9] . 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the 9th Workshop on Statistical Machine Translation. 376–380.Google ScholarCross Ref
- [10] . 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 10695–10704.Google ScholarCross Ref
- [11] . 2015. Exploring nearest neighbor approaches for image captioning. CoRRabs/1505.04467, 2015.Google Scholar
- [12] . 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 2625–2634.Google ScholarCross Ref
- [13] . 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the 11th European Conference on Computer Vision (ECCV’10). Springer, 15–29.Google ScholarCross Ref
- [14] . 2016. Sequential neural models with stochastic layers. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16).Google ScholarDigital Library
- [15] . 2017. Beam search strategies for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation. 56–60.Google ScholarCross Ref
- [16] . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.Google ScholarDigital Library
- [17] . 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 4634–4643.Google ScholarCross Ref
- [18] . 2017. Categorical reparameterization with gumbel-softmax. In Proceedings of the International Conference on Learning Representations (ICLR’17).Google Scholar
- [19] . 2015. Guiding the long-short term memory model for image caption generation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2407–2415.Google ScholarDigital Library
- [20] . 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3128–3137.Google ScholarCross Ref
- [21] . 2019. Reflective decoding network for image captioning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 8887–8896.Google ScholarCross Ref
- [22] . 2017. A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 317–325.Google ScholarCross Ref
- [23] . 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12, 2891–2903.Google ScholarDigital Library
- [24] . 2018. Generating diverse and accurate visual captions by comparative adversarial learning. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS’18).Google Scholar
- [25] . 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out. 74–81.Google Scholar
- [26] . 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). Springer, 740–755.Google ScholarCross Ref
- [27] . 2018. Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer, 338–354.Google ScholarDigital Library
- [28] . 2020. Latent normalizing flows for many-to-many cross-domain mappings. In Proceedings of the 8th International Conference on Learning Representations (ICLR’20).Google Scholar
- [29] . 2020. Diverse image captioning with context-object split latent spaces. Advances in Neural Information Processing Systems 33 (2020), 3613–3624.Google Scholar
- [30] . 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15).Google Scholar
- [31] . 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 10971–10980.Google ScholarCross Ref
- [32] . 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.Google ScholarDigital Library
- [33] . 2019. Improving diversity of image captioning through variational autoencoders and adversarial learning. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV’19). 263–272.
DOI: Google ScholarCross Ref - [34] . 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2015), 1137–1149.Google ScholarDigital Library
- [35] . 2017. Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 290–298.Google ScholarCross Ref
- [36] . 2017. Speaking the same language: Matching machine to human captions by adversarial training. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 4135–4144.Google ScholarCross Ref
- [37] . 1971. A stochastic approximation method. IEEE Transactions on Systems, Man, and Cybernetics4 (1971), 338–344.Google ScholarCross Ref
- [38] . 2003. Matching words and pictures. The Journal of Machine Learning Research 3, 38 (2003), 1107–1135.Google Scholar
- [39] . 2018. Diverse beam search for improved description of complex scenes. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’18). 7371–7379.Google Scholar
- [40] . 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4566–4575.Google ScholarCross Ref
- [41] . 2020. Joint optimization for cooperative image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR’19). 8898–8907.Google Scholar
- [42] . 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3156–3164.Google ScholarCross Ref
- [43] . 2022. On distinctive image captioning via comparing and reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022), 1–1.Google Scholar
- [44] . 2017. Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). 5758–5768.Google ScholarDigital Library
- [45] . 2016. Review networks for caption generation. Advances in Neural Information Processing Systems 29 (2016), 2361–2369.Google Scholar
- [46] . 2018. Exploring visual relationship for image captioning. In Proceedings of the 15th European Conference on Computer Vision (ECCV’18), Vol. 11218. 711–727.Google ScholarDigital Library
- [47] . 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4651–4659.Google ScholarCross Ref
Index Terms
- Diverse Image Captioning via Conditional Variational Autoencoder and Dual Contrastive Learning
Recommendations
Adversarial and Contrastive Variational Autoencoder for Sequential Recommendation
WWW '21: Proceedings of the Web Conference 2021Sequential recommendation as an emerging topic has attracted increasing attention due to its important practical significance. Models based on deep learning and attention mechanism have achieved good performance in sequential recommendation. Recently, ...
Diverse Image Captioning with Grounded Style
Pattern RecognitionAbstractStylized image captioning as presented in prior work aims to generate captions that reflect characteristics beyond a factual description of the scene composition, such as sentiments. Such prior work relies on given sentiment identifiers, which are ...
VAEPP: Variational Autoencoder with a Pull-Back Prior
Neural Information ProcessingAbstractMany approaches to training generative models by distinct training objectives have been proposed in the past. Variational Autoencoder (VAE) is an outstanding model of them based on log-likelihood. In this paper, we propose a novel learnable prior, ...
Comments