Skip to main content

Unsupervised Style Control for Image Captioning

  • Conference paper
  • First Online:
Data Science (ICPCSEE 2022)

Abstract

We propose a novel unsupervised image captioning method. Image captioning involves two fields of deep learning, natural language processing and computer vision. The excessive pursuit of model evaluation results makes the caption style generated by the model too monotonous, which is difficult to meet people’s demands for vivid and stylized image captions. Therefore, we propose an image captioning model that combines text style transfer and image emotion recognition methods, with which the model can better understand images and generate controllable stylized captions. The proposed method can automatically judge the emotion contained in the image through the image emotion recognition module, better understand the image content, and control the description through the text style transfer method, thereby generating captions that meet people’s expectations. To our knowledge, this is the first work to use both image emotion recognition and text style control.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)

    Google Scholar 

  2. Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)

    Google Scholar 

  3. Chen, T., et al.: “factual”or “emotional”: stylized image captioning with adaptive learning and attention. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 519–535 (2018)

    Google Scholar 

  4. Dathathri, S., et al.: Plug and play language models: a simple approach to controlled text generation (2019)

    Google Scholar 

  5. Fu, Z., Tan, X., Peng, N., Zhao, D., Yan, R.: Style transfer in text: exploration and evaluation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  6. Gan, C., Gan, Z., He, X., Gao, J., Deng, L.: Stylenet: generating attractive visual captions with styles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3137–3146 (2017)

    Google Scholar 

  7. Guo, L., Liu, J., Yao, P., Li, J., Lu, H.: MSCap: multi-style image captioning with unpaired stylized text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4204–4213 (2019)

    Google Scholar 

  8. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)

    Google Scholar 

  9. Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8

    Chapter  Google Scholar 

  10. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  11. Luo, Y., et al.: Dual-level collaborative transformer for image captioning. arXiv preprint arXiv:2101.06462 (2021)

  12. Mathews, A., Xie, L., He, X.: Semstyle: learning to generate stylised image captions using unaligned text. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  13. Mathews, A., Xie, L., He, X.: Senticap: generating image descriptions with sentiments. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)

    Google Scholar 

  14. Mathews, A., Xie, L., He, X.: Semstyle: learning to generate stylised image captions using unaligned text. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8591–8600 (2018)

    Google Scholar 

  15. Mokady, R., Hertz, A., Bermano, A.H.: ClipCap: clip prefix for image captioning (2021)

    Google Scholar 

  16. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  17. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  18. Radford, A., Jeffrey, W., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)

    Google Scholar 

  19. Rakhlin, A.: Convolutional neural networks for sentence classification. GitHub (2016)

    Google Scholar 

  20. She, D., Yang, J., Cheng, M.-M., Lai, Y.-K., Rosin, P.L., Wang, L.: WSCNet: weakly supervised coupled networks for visual sentiment classification and detection. IEEE Trans. Multimedia 22(5), 1358–1371 (2019)

    Article  Google Scholar 

  21. Shen, S., et al.: How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383 (2021)

  22. Stolcke, A.: SRILM-an extensible language modeling toolkit. In: Seventh International Conference on Spoken Language Processing (2002)

    Google Scholar 

  23. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)

    Google Scholar 

  24. Wang, K., Hua, H., Wan, X.: Controllable unsupervised text attribute transfer via editing entangled latent representation. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  25. Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 684–699 (2018)

    Google Scholar 

  26. You, Q., Jin, H., Luo, J.: Image captioning at will: a versatile scheme for effectively injecting sentiments into image descriptions. arXiv preprint arXiv:1801.10121 (2018)

  27. Zhao, S., Ding, G., Huang, Q., Chua, T.-S., Schuller, B.W., Keutzer, K.: Affective image content analysis: a comprehensive survey. In: IJCAI, pp. 5534–5541 (2018)

    Google Scholar 

  28. Zhao, S., et al.: Affective image content analysis: two decades review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. (2021)

    Google Scholar 

  29. Zhou, C., et al.: Exploring contextual word-level style relevance for unsupervised style transfer. arXiv preprint arXiv:2005.02049 (2020)

Download references

Acknowledgment

This work is supported by the National Key Research & Development Program (Grant No. 2018YFC0831700) and National Natural Science Foundation of China (Grant No. 61671064, No. 61732005).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shumin Shi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tian, J., Yang, Z., Shi, S. (2022). Unsupervised Style Control for Image Captioning. In: Wang, Y., Zhu, G., Han, Q., Wang, H., Song, X., Lu, Z. (eds) Data Science. ICPCSEE 2022. Communications in Computer and Information Science, vol 1628. Springer, Singapore. https://doi.org/10.1007/978-981-19-5194-7_31

Download citation

  • DOI: https://doi.org/10.1007/978-981-19-5194-7_31

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-19-5193-0

  • Online ISBN: 978-981-19-5194-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics