Skip to main content
Log in

Scene graph sorting and shuffle polishing based controllable image captioning

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Controllable image captioning has become a hot research topic in recent years by controlling the parameters or conditions in the generation process to obtain descriptions that meet the user’s expectations. However, the captions generated by the current controllable caption model usually contain only a single simple sentence, and the sentences tend to be templated without rich linguistic expressions, resulting in monotonous and boring descriptions. To solve these problems, this paper proposes an optimization method for controllable image caption generation based on scene graph sorting-selection (SS) and shuffle-polishing (SP). First, we utilize the scene graph technique to capture the complex object relationships in images and decompose them into multiple substructures, and then introduce a new two-dimensional matching score algorithm to sort the decomposed multiple substructures. The selected substructures are decoded into multiple target sentences based on the sorting result or user intent, which describe the image content from different perspectives to increase the diversity and richness of the generated results. Subsequently, in the shuffle polishing stage, we take the sentences initially generated by the language decoder module as complete contextual information, and utilize the rich linguistic structure as well as vocabulary knowledge learned from the unsupervised model to iteratively update the words in each position of the sentences in a randomized order, to ultimately generate flexible yet personalized descriptions. Extensive experiments on MSCOCO and Flickr30k Entities datasets validate the excellent performance of our model in controllable image captioning, generating novel and diverse captions while taking into account the accuracy of descriptions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availability

The datasets used in this study are available on their respective official websites.

References

  1. Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mohammed, O.K., Singhal, S., Som, S., et al.: Image as a foreign language: Beit pretraining for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19175–19186 (2023)

  2. Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2818–2829 (2023)

  3. Lin, B., Tang, Z., Ye, Y., Cui, J., Zhu, B., Jin, P., Zhang, J., Ning, M., Yuan, L.: Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947 (2024)

  4. Liu, Y., Yang, B., Liu, Q., Li, Z., Ma, Z., Zhang, S., Bai, X.: Textmonkey: An ocr-free large multimodal model for understanding document. arXiv preprint arXiv:2403.04473 (2024)

  5. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR

  6. Chen, S., He, X., Guo, L., Zhu, X., Wang, W., Tang, J., Liu, J.: Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345 (2023)

  7. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

  8. Kuo, C.-W., Kira, Z.: Beyond a pre-trained object detector: Cross-modal textual and visual context for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17969–17979 (2022)

  9. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning, pp. 19730–19742 (2023). PMLR

  10. Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., Yang, H.: OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International Conference on Machine Learning, pp. 23318–23340 (2022). PMLR

  11. Li, C., Xu, H., Tian, J., Wang, W., Yan, M., Bi, B., Ye, J., Chen, H., Xu, G., Cao, Z., et al.: mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005 (2022)

  12. Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., Gao, J.: Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5579–5588 (2021)

  13. Sarto, S., Cornia, M., Baraldi, L., Cucchiara, R.: Retrieval-augmented transformer for image captioning. In: Proceedings of the 19th International Conference on Content-based Multimedia Indexing, pp. 1–7 (2022)

  14. Zhao, W., Wu, X., Zhang, X.: Memcap: Memorizing style knowledge for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12984–12992 (2020)

  15. Zeng, Z., Zhang, H., Lu, R., Wang, D., Chen, B., Wang, Z.: Conzic: Controllable zero-shot image captioning by sampling-based polishing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23465–23476 (2023)

  16. Wang, Q., Chan, A.B.: Describing like humans: on diversity in image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4195–4203 (2019)

  17. Cornia, M., Baraldi, L., Cucchiara, R.: Show, control and tell: A framework for generating controllable and grounded captions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8307–8316 (2019)

  18. Deshpande, A., Aneja, J., Wang, L., Schwing, A.G., Forsyth, D.: Fast, diverse and accurate image captioning guided by part-of-speech. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10695–10704 (2019)

  19. Lindh, A., Ross, R.J., Kelleher, J.D.: Language-driven region pointer advancement for controllable image captioning. arXiv preprint arXiv:2011.14901 (2020)

  20. Meng, Z., Yu, L., Zhang, N., Berg, T.L., Damavandi, B., Singh, V., Bearman, A.: Connecting what to say with where to look by modeling human attention traces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12679–12688 (2021)

  21. Chen, L., Jiang, Z., Xiao, J., Liu, W.: Human-like controllable image captioning with verb-specific semantic roles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16846–16856 (2021)

  22. Zhang, J., Mei, K., Zheng, Y., Fan, J.: Integrating part of speech guidance for image captioning. IEEE Trans. Multimed. 23, 92–104 (2020)

    MATH  Google Scholar 

  23. Wang, N., Xie, J., Wu, J., Jia, M., Li, L.: Controllable image captioning via prompting. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 2617–2625 (2023)

  24. Zhong, Y., Wang, L., Chen, J., Yu, D., Li, Y.: Comprehensive image captioning via scene graph decomposition. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pp. 211–229 (2020). Springer

  25. Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: Scene graph parsing with global context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5831–5840 (2018)

  26. Tewel, Y., Shalev, Y., Schwartz, I., Wolf, L.: Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17918–17928 (2022)

  27. Suresh, K.R., Jarapala, A., Sudeep, P.: Image captioning encoder-decoder models using CNN-RNN architectures: a comparative study. Circ. Syst. Signal Process. 41(10), 5719–5742 (2022)

    Google Scholar 

  28. Liu, W., Chen, S., Guo, L., Zhu, X., Liu, J.: Cptr: Full transformer network for image captioning. arXiv preprint arXiv:2101.10804 (2021)

  29. Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pp. 121–137 (2020). Springer

  30. Chen, Q., Deng, C., Wu, Q.: Learning distinct and representative modes for image captioning. Adv. Neural. Inf. Process. Syst. 35, 9472–9485 (2022)

    MATH  Google Scholar 

  31. Klusowski, J.M., Wu, Y.: Counting motifs with graph sampling. In: Conference On Learning Theory, pp. 1966–2011 (2018). PMLR

  32. Shi, Z., Zhou, X., Qiu, X., Zhu, X.: Improving image captioning with better use of captions. arXiv preprint arXiv:2006.11807 (2020)

  33. Dharma, E.M., Gaol, F.L., Warnars, H., Soewito, B.: The accuracy comparison among word2vec, glove, and fasttext towards convolution neural network (cnn) text classification. J. Theor. Appl. Inf. Technol. 100(2), 349–359 (2022)

    Google Scholar 

  34. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)

  35. Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018)

  36. Camacho Olmedo, M.T., Paegelow, M., Mas, J.-F., Escobar, F.: Geomatic Approaches for Modeling Land Change Scenarios. Springer, An Introduction (2018)

  37. Bird, S.: Nltk: the natural language toolkit. In: Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, pp. 69–72 (2006)

  38. Leacock, C., Chodorow, M., Miller, G.A.: Using corpus statistics and wordnet relations for sense identification. Comput. Linguist. 24(1), 147–165 (1998)

    MATH  Google Scholar 

  39. McCrae, J.P., Rudnicka, E., Bond, F.: English wordnet: A new open-source wordnet for english. K Lexical News 28, 37–44 (2020)

    MATH  Google Scholar 

  40. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  41. Park, T., Lee, S.: Improving the Gibbs sampler. Wiley Interdiscip. Rev.: Comput. Stat. 14(2), 1546 (2022)

    MathSciNet  MATH  Google Scholar 

  42. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755 (2014). Springer

  43. Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)

  44. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)

  45. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  46. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

  47. Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/or Summarization, pp. 65–72 (2005)

  48. Lin, C.-Y.: Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)

  49. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)

  50. Anderson, P., Fernando, B., Johnson, M., Gould, S.: Spice: Semantic propositional image caption evaluation. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pp. 382–398 (2016). Springer

  51. Yang, L., Liu, Y., Peng, Y., He, L.: Variational transformer: A framework beyond the trade-off between accuracy and diversity for image captioning. arXiv preprint arXiv:2205.14458 (2022)

  52. Aneja, J., Agrawal, H., Batra, D., Schwing, A.: Sequential latent spaces for modeling the intention during diverse image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4261–4270 (2019)

  53. Fang, Z., Wang, J., Hu, X., Liang, L., Gan, Z., Wang, L., Yang, Y., Liu, Z.: Injecting semantic concepts into end-to-end image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18009–18019 (2022)

  54. Nguyen, V.-Q., Suganuma, M., Okatani, T.: Grit: Faster and better image captioning transformer using dual visual features. In: European Conference on Computer Vision, pp. 167–184 (2022). Springer

  55. Xue, L., Zhang, A., Wang, R., Yang, J.: Psnet: position-shift alignment network for image caption. Int. J. Multimed. Inf. Retr. 12(2), 42 (2023)

    MATH  Google Scholar 

  56. Sun, J.-T., Min, X.: Research on image caption generation method based on multi-modal pre-training model and text mixup optimization. Signal, Image and Video Processing, 1–19 (2024)

  57. Wang, Z., Xiao, J., Chen, T., Chen, L.: Decap: Towards generalized explicit caption editing via diffusion mechanism. arXiv preprint arXiv:2311.14920 (2023)

  58. Chen, S., Jin, Q., Wang, P., Wu, Q.: Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9962–9971 (2020)

Download references

Funding

No funding was received for conducting this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qian Zhao.

Ethics declarations

Conflict of interest

The authors declare that they have no Conflict of interest.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, G., Zhao, Q. & Liu, X. Scene graph sorting and shuffle polishing based controllable image captioning. SIViP 19, 309 (2025). https://doi.org/10.1007/s11760-025-03921-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11760-025-03921-2

Keywords