Unsupervised Style Control for Image Captioning

Tian, Junyu; Yang, Zhikun; Shi, Shumin

doi:10.1007/978-981-19-5194-7_31

Junyu Tian¹¹,
Zhikun Yang¹¹ &
Shumin Shi^11,12

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1628))

Included in the following conference series:

International Conference of Pioneering Computer Scientists, Engineers and Educators

693 Accesses
1 Citations

Abstract

We propose a novel unsupervised image captioning method. Image captioning involves two fields of deep learning, natural language processing and computer vision. The excessive pursuit of model evaluation results makes the caption style generated by the model too monotonous, which is difficult to meet people’s demands for vivid and stylized image captions. Therefore, we propose an image captioning model that combines text style transfer and image emotion recognition methods, with which the model can better understand images and generate controllable stylized captions. The proposed method can automatically judge the emotion contained in the image through the image emotion recognition module, better understand the image content, and control the description through the text style transfer method, thereby generating captions that meet people’s expectations. To our knowledge, this is the first work to use both image emotion recognition and text style control.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Google Scholar
Chen, T., et al.: “factual”or “emotional”: stylized image captioning with adaptive learning and attention. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 519–535 (2018)
Google Scholar
Dathathri, S., et al.: Plug and play language models: a simple approach to controlled text generation (2019)
Google Scholar
Fu, Z., Tan, X., Peng, N., Zhao, D., Yan, R.: Style transfer in text: exploration and evaluation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Gan, C., Gan, Z., He, X., Gao, J., Deng, L.: Stylenet: generating attractive visual captions with styles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3137–3146 (2017)
Google Scholar
Guo, L., Liu, J., Yao, P., Li, J., Lu, H.: MSCap: multi-style image captioning with unpaired stylized text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4204–4213 (2019)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Google Scholar
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Chapter Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Luo, Y., et al.: Dual-level collaborative transformer for image captioning. arXiv preprint arXiv:2101.06462 (2021)
Mathews, A., Xie, L., He, X.: Semstyle: learning to generate stylised image captions using unaligned text. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Mathews, A., Xie, L., He, X.: Senticap: generating image descriptions with sentiments. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)
Google Scholar
Mathews, A., Xie, L., He, X.: Semstyle: learning to generate stylised image captions using unaligned text. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8591–8600 (2018)
Google Scholar
Mokady, R., Hertz, A., Bermano, A.H.: ClipCap: clip prefix for image captioning (2021)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Radford, A., Jeffrey, W., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Google Scholar
Rakhlin, A.: Convolutional neural networks for sentence classification. GitHub (2016)
Google Scholar
She, D., Yang, J., Cheng, M.-M., Lai, Y.-K., Rosin, P.L., Wang, L.: WSCNet: weakly supervised coupled networks for visual sentiment classification and detection. IEEE Trans. Multimedia 22(5), 1358–1371 (2019)
Article Google Scholar
Shen, S., et al.: How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383 (2021)
Stolcke, A.: SRILM-an extensible language modeling toolkit. In: Seventh International Conference on Spoken Language Processing (2002)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Wang, K., Hua, H., Wan, X.: Controllable unsupervised text attribute transfer via editing entangled latent representation. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 684–699 (2018)
Google Scholar
You, Q., Jin, H., Luo, J.: Image captioning at will: a versatile scheme for effectively injecting sentiments into image descriptions. arXiv preprint arXiv:1801.10121 (2018)
Zhao, S., Ding, G., Huang, Q., Chua, T.-S., Schuller, B.W., Keutzer, K.: Affective image content analysis: a comprehensive survey. In: IJCAI, pp. 5534–5541 (2018)
Google Scholar
Zhao, S., et al.: Affective image content analysis: two decades review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. (2021)
Google Scholar
Zhou, C., et al.: Exploring contextual word-level style relevance for unsupervised style transfer. arXiv preprint arXiv:2005.02049 (2020)

Download references

Acknowledgment

This work is supported by the National Key Research & Development Program (Grant No. 2018YFC0831700) and National Natural Science Foundation of China (Grant No. 61671064, No. 61732005).

Author information

Authors and Affiliations

School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Junyu Tian, Zhikun Yang & Shumin Shi
Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications, Beijing, China
Shumin Shi

Authors

Junyu Tian
View author publications
You can also search for this author in PubMed Google Scholar
Zhikun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Shumin Shi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shumin Shi .

Editor information

Editors and Affiliations

Southwest Petroleum University, Chengdu, China
Yang Wang
University of Electronic Science and Technology of China, Chengdu, China
Guobin Zhu
Harbin Engineering University, Harbin, China
Qilong Han
Harbin Institute of Technology, Harbin, China
Hongzhi Wang
Harbin University of Science and Technology, Harbin, China
Xianhua Song
National Academy of Guo Ding Institute of Data Sciences, Beijing, China
Zeguang Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tian, J., Yang, Z., Shi, S. (2022). Unsupervised Style Control for Image Captioning. In: Wang, Y., Zhu, G., Han, Q., Wang, H., Song, X., Lu, Z. (eds) Data Science. ICPCSEE 2022. Communications in Computer and Information Science, vol 1628. Springer, Singapore. https://doi.org/10.1007/978-981-19-5194-7_31

Download citation

DOI: https://doi.org/10.1007/978-981-19-5194-7_31
Published: 10 August 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-5193-0
Online ISBN: 978-981-19-5194-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unsupervised Style Control for Image Captioning