skip to main content
research-article

Diverse Image Captioning via Conditional Variational Autoencoder and Dual Contrastive Learning

Authors Info & Claims
Published:18 September 2023Publication History
Skip Abstract Section

Abstract

Diverse image captioning has achieved substantial progress in recent years. However, the discriminability of generative models and the limitation of cross entropy loss are generally overlooked in the traditional diverse image captioning models, which seriously hurts both the diversity and accuracy of image captioning. In this article, aiming to improve diversity and accuracy simultaneously, we propose a novel Conditional Variational Autoencoder (DCL-CVAE) framework for diverse image captioning by seamlessly integrating sequential variational autoencoder with contrastive learning. In the encoding stage, we first build conditional variational autoencoders to separately learn the sequential latent spaces for a pair of captions. Then, we introduce contrastive learning in the sequential latent spaces to enhance the discriminability of latent representations for both image-caption pairs and mismatched pairs. In the decoding stage, we leverage the captions sampled from the pre-trained Long Short-Term Memory (LSTM), LSTM decoder as the negative examples and perform contrastive learning with the greedily sampled positive examples, which can restrain the generation of common words and phrases induced by the cross entropy loss. By virtue of dual constrastive learning, DCL-CVAE is capable of encouraging the discriminability and facilitating the diversity, while promoting the accuracy of the generated captions. Extensive experiments are conducted on the challenging MSCOCO dataset, showing that our proposed methods can achieve a better balance between accuracy and diversity compared to the state-of-the-art diverse image captioning models.

REFERENCES

  1. [1] Goyal Anirudh Goyal Alias Parth, Sordoni Alessandro, Côté Marc-Alexandre, Ke Nan Rosemary, and Bengio Yoshua. 2017. Z-forcing: Training stochastic recurrent networks. In Proceedings of the 31st Conference on Neural Information Processing Systems. 66976707.Google ScholarGoogle Scholar
  2. [2] Anderson Peter, Fernando Basura, Johnson Mark, and Gould Stephen. 2016. Spice: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision (ECCV’16). Springer, 382398.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 60776086.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Aneja Jyoti, Agrawal Harsh, Batra Dhruv, and Schwing Alexander. 2019. Sequential latent spaces for modeling the intention during diverse image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 42614270.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chen Xinlei and Zitnick C. Lawrence. 2015. Mind’s eye: A recurrent visual representation for image caption generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 24222431.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Chung Junyoung, Kastner Kyle, Dinh Laurent, Goel Kratarth, Courville Aaron C., and Bengio Yoshua. 2015. A recurrent latent variable model for sequential data. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15). 29802988.Google ScholarGoogle Scholar
  7. [7] Cornia Marcella, Stefanini Matteo, Baraldi Lorenzo, and Cucchiara Rita. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 1057810587.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Dai Bo and Lin Dahua. 2017. Contrastive learning for image captioning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). 898907.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Denkowski Michael and Lavie Alon. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the 9th Workshop on Statistical Machine Translation. 376380.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Deshpande Aditya, Aneja Jyoti, Wang Liwei, Schwing Alexander G., and Forsyth David. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 1069510704.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Devlin Jacob, Gupta Saurabh, Girshick Ross, Mitchell Margaret, and Zitnick C. Lawrence. 2015. Exploring nearest neighbor approaches for image captioning. CoRRabs/1505.04467, 2015.Google ScholarGoogle Scholar
  12. [12] Donahue Jeffrey, Hendricks Lisa Anne, Guadarrama Sergio, Rohrbach Marcus, Venugopalan Subhashini, Saenko Kate, and Darrell Trevor. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 26252634.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Farhadi Ali, Hejrati Mohsen, Sadeghi Mohammad Amin, Young Peter, Rashtchian Cyrus, Hockenmaier Julia, and Forsyth David. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the 11th European Conference on Computer Vision (ECCV’10). Springer, 1529.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Fraccaro Marco, Sønderby Søren Kaae, Paquet Ulrich, and Winther Ole. 2016. Sequential neural models with stochastic layers. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16).Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Freitag Markus and Al-Onaizan Yaser. 2017. Beam search strategies for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation. 5660.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 17351780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Huang Lun, Wang Wenmin, Chen Jie, and Wei Xiao-Yong. 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 46344643.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Jang Eric, Gu Shixiang, and Poole Ben. 2017. Categorical reparameterization with gumbel-softmax. In Proceedings of the International Conference on Learning Representations (ICLR’17).Google ScholarGoogle Scholar
  19. [19] Jia Xu, Gavves Efstratios, Fernando Basura, and Tuytelaars Tinne. 2015. Guiding the long-short term memory model for image caption generation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 24072415.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Karpathy Andrej and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 31283137.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Ke Lei, Pei Wenjie, Li Ruiyu, Shen Xiaoyong, and Tai Yu-Wing. 2019. Reflective decoding network for image captioning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 88878896.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Krause Jonathan, Johnson Justin, Krishna Ranjay, and Fei-Fei Li. 2017. A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 317325.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Kulkarni Girish, Premraj Visruth, Ordonez Vicente, Dhar Sagnik, Li Siming, Choi Yejin, Berg Alexander C., and Berg Tamara L.. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12, 28912903.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Li Dianqi, Huang Qiuyuan, He Xiaodong, Zhang Lei, and Sun Ming-Ting. 2018. Generating diverse and accurate visual captions by comparative adversarial learning. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS’18).Google ScholarGoogle Scholar
  25. [25] Lin Chin-Yew. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out. 7481.Google ScholarGoogle Scholar
  26. [26] Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). Springer, 740755.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Liu Xihui, Li Hongsheng, Shao Jing, Chen Dapeng, and Wang Xiaogang. 2018. Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer, 338354.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Mahajan Shweta, Gurevych Iryna, and Roth Stefan. 2020. Latent normalizing flows for many-to-many cross-domain mappings. In Proceedings of the 8th International Conference on Learning Representations (ICLR’20).Google ScholarGoogle Scholar
  29. [29] Mahajan Shweta and Roth Stefan. 2020. Diverse image captioning with context-object split latent spaces. Advances in Neural Information Processing Systems 33 (2020), 36133624.Google ScholarGoogle Scholar
  30. [30] Mao Junhua, Xu Wei, Yang Yi, Wang Jiang, Huang Zhiheng, and Yuille Alan. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  31. [31] Pan Yingwei, Yao Ting, Li Yehao, and Mei Tao. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 1097110980.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Ren Li, Qi Guo-Jun, and Hua Kien. 2019. Improving diversity of image captioning through variational autoencoders and adversarial learning. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV’19). 263272. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2015), 11371149.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Ren Zhou, Wang Xiaoyu, Zhang Ning, Lv Xutao, and Li Li-Jia. 2017. Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 290298.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Shetty Rakshith, Rohrbach Marcus, Hendricks Lisa Anne, Fritz Mario, and Schiele Bernt. 2017. Speaking the same language: Matching machine to human captions by adversarial training. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 41354144.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Sinha Naresh K. and Griscik Michael P.. 1971. A stochastic approximation method. IEEE Transactions on Systems, Man, and Cybernetics4 (1971), 338344.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Jordan Kobus Barnard, Pinar Duygulu, David Forsyth, Nando de Freitas, David M. Blei, and Michael I.. 2003. Matching words and pictures. The Journal of Machine Learning Research 3, 38 (2003), 11071135.Google ScholarGoogle Scholar
  39. [39] Batra Ashwin Vijayakumar, Michael Cogswell, Ramprasaath Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv. 2018. Diverse beam search for improved description of complex scenes. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’18). 73717379.Google ScholarGoogle Scholar
  40. [40] Vedantam Ramakrishna, Zitnick C. Lawrence, and Parikh Devi. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 45664575.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Vered Gilad, Oren Gal, Atzmon Yuval, and Chechik Gal. 2020. Joint optimization for cooperative image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR’19). 88988907.Google ScholarGoogle Scholar
  42. [42] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 31563164.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Wang Jiuniu, Xu Wenjia, Wang Qingzhong, and Chan Antoni B.. 2022. On distinctive image captioning via comparing and reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022), 11.Google ScholarGoogle Scholar
  44. [44] Wang Liwei, Schwing Alexander, and Lazebnik Svetlana. 2017. Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). 57585768.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Yang Zhilin, Yuan Ye, Wu Yuexin, Cohen William W., and Salakhutdinov Russ R.. 2016. Review networks for caption generation. Advances in Neural Information Processing Systems 29 (2016), 2361–2369.Google ScholarGoogle Scholar
  46. [46] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2018. Exploring visual relationship for image captioning. In Proceedings of the 15th European Conference on Computer Vision (ECCV’18), Vol. 11218. 711727.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] You Quanzeng, Jin Hailin, Wang Zhaowen, Fang Chen, and Luo Jiebo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 46514659.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Diverse Image Captioning via Conditional Variational Autoencoder and Dual Contrastive Learning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 1
        January 2024
        639 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3613542
        • Editor:
        • Abdulmotaleb El Saddik
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 September 2023
        • Online AM: 11 August 2023
        • Accepted: 28 July 2023
        • Revised: 1 July 2023
        • Received: 16 December 2022
        Published in tomm Volume 20, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)316
        • Downloads (Last 6 weeks)40

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text