Skip to main content
Log in

Controllable image caption with an encoder-decoder optimization structure

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Controllable image caption, which belongs to the intersection of Computer Vision (CV) and Natural Language Process (NLP), is an important part of applying artificial intelligence to many life scenes. We adopt an encoder-decoder structure, which considers visual models as the encoder and regards language models as the decoder. In this work, we introduce a new feature extraction model, namely FVC R-CNN, to learn both the salient features and the visual commonsense features. Furthermore, a novel MT-LSTM neural network for sentence generation is proposed, which is activated by m-tanh and is superior to the traditional Long Short-term memory Network (LSTM) by a significant margin. Finally, we put forward a multi-branch decision strategy to optimize the output. The experimental results are conducted on the widely used COCO Entities dataset, which demonstrates that the proposed method simultaneously outperforms the baseline, surpassing the state-of-the-art methods under a wide range of evaluation metrics. There are CIDEr and SPICE respectively achieves 206.3 and 47.6, yield state-of-the-art (SOTA) performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Kulkarni G, Premraj V, Ordonez V, et al. (2013) Babytalk: Understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903

    Article  Google Scholar 

  2. Kuznetsova P, Ordonez V, Berg AC et al (2012) Collective generation of natural image descriptions. In: Proceedings of the 50th annual meeting of the association for computational linguistics: Long Papers-Volume 1. Association for Computational Linguistics, pp 359–368

  3. Kuznetsova P, Ordonez V, Berg T et al (2014) Tree talk: Composition and compression of trees for image descriptions. Trans Assoc Comput Linguist 2(1):351–362

    Article  Google Scholar 

  4. Kuznetsova P, Ordonez V, Berg A et al (2013) Generalizing image captions for image-text parallel corpus. In: Proceedings of the 51st annual meeting of the association for computational linguistics (Volume 2: Short Papers), pp 790–796

  5. Mao J, Xu W, Yang Y et al (2014) Explain images with multimodal recurrent neural networks. arXiv:1410.1090

  6. Cho K, Van Merrienboer B, Gulcehre C et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078

  7. Karpathy A, Li F-F (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Boston, USA, pp 3128–3137

  8. Vinyals O, Toshev A, Bengio S et al (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on 1636 computer vision and pattern recognition (CVPR), Boston, USA, pp 3156–3164

  9. Xu K, Jimmy LB (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32th international conference on machine learning, Lille, France, pp 2048–2057

  10. Jia X, Gavves E, Fernando B et al (2015) Guiding the long-short term memory model for image caption generation. In: IEEE international conference on computer vision, pp 2407–2415

  11. Quanzeng Y, Hailin J, Wang ZW et al (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, USA, pp 4651–4659

  12. Chen L, Zhang H, Xiao J et al (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, pp 5659–5667

  13. Lu J, Yang J, Batra D et al (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Salt Lake City, USA, pp 7219– 7228

  14. Cornia M, Baraldi L, Cucchiara R (2019) Show, control and tell: A framework for generating controllable and grounded captions. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  15. Chen S, Jin Q, Wang P et al (2020) Say as you wish: fine-grained control of image caption generation with abstract scene graphs. In: CVPR

  16. Deng C, Ding N, Tan M et al (2020) Length-controllable image captioning. arXiv:2005.14386

  17. Lindh A, Ross R, Kelleher JD (2020) Language-driven region pointer advancement for controllable image captioning. arXiv:2011.14901

  18. Zhu Z, Wang T, Qu H (2021) Macroscopic control of text generation for image captioning. arXiv:2101.08000

  19. Gu J, Cai J, Joty S et al (2018) Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In: 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR)

  20. Liu Y, Guo Y, Bakker EM et al (2017) Learning a recurrent residual fusion network for multimodal matching. In: IEEE international conference on computer vision

  21. Wang L, Li Y, Huang Y, Lazebnik S (2017) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell PP(99):1–1

    Google Scholar 

  22. Wu Y, Wang S, Huang Q (2018) Learning semantic structure-preserved embeddings for cross-modal retrieval. In: 2018, ACM multimedia conference

  23. Huang F, Zhang X, Zhao Z et al (2018) Bi-directional spatial-semantic attention networks for image-text matching. IEEE Trans Image Process PP(4):1–1

    MathSciNet  Google Scholar 

  24. Lee KH, Xi C, Gang H et al (2018) Stacked cross attention for image-text matching

  25. Wang Y, Yang H, Qian X et al (2019) Position focused attention network for image-text matching

  26. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. IntConf Learn Representations

  27. He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Computer vision and pattern recognition. IEEE, pp 770–778

  28. Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake, USA, pp 6077–6086

  29. Shaoqing R, He K et al, GirshickK R (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

  30. Girshick R, Donahue J, Darrell T et al (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

  31. Girshick R (2015) Fast R-CNN. Comput Sci

  32. Wang T, Huang J, Zhang H et al (2020) Visual commonsense R-CNN. In: 2020, IEEE conference on computer vision and pattern recognition (CVPR)

  33. Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci 79(8):2554–2558

    Article  MathSciNet  Google Scholar 

  34. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Sig Process 45(11):2673–2681

    Article  Google Scholar 

  35. Hochreiter S, Schmidhuber J (1997) Long short-termmemory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  36. Lu J, Xiong C, Parikh D et al (2017) Knowing when to look: Adaptive attention via A visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, Hawaii, USA, pp 375–383

  37. Godin F, Degrave J, Dambre J et al (2018) Dual rectified linear units (DReLUs): A replacement for tanh activation functions in quasi-recurrent neural networks. Pattern Recogn Lett 116(12):8–14

    Article  Google Scholar 

  38. Papineni K, Roukos S, Ward S et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, p 5

  39. Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation

  40. Lin C-Y (2004) ROUGE: A package for automatic evaluation of summaries. In: Text summarization branches out. Association for Computational Linguistics, Barcelona, Spain, p 5

  41. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR)

  42. Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. Adapt Behav 11(4):382–398

    Google Scholar 

  43. Liu W, Chen S, Guo L et al (2021) CPTR: Full transformer network for image captioning. arXiv:2101.10804v3

  44. Shi Z, Zhou X, Qiu X, Zhu X (2020) Improving image captioning with better use of captions. arXiv:2006.11807v1

  45. Huang L, Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. In: 2019 international joint conference on neural networks (IJCNN)

  46. Gan C, Gan Z, He X, Gao J, Deng L (2017) Stylenet: Generating attractive visual captions with styles. In: CVPR

  47. Mathews A, Xie L, He X (2016) Senticap: Generating image descriptions with sentiments. In: AAAI

  48. Mathews A, Xie L, He X (2018) Semstyle: Learning to generate stylised image captions using unaligned text. In: CVPR

  49. Shuster K, Humeau K, Hu H, Bordes A, Weston J (2019) Engaging image captioning via personality. In: CVPR

  50. Kim D-J, Choi J, Oh T-H, Kweon IS (2019) Dense relational captioning: Triple-stream networks for relationship-based captioning. In: CVPR

  51. Zheng Y, Li Y, Wang S (2019) Intention oriented image captions with guiding objects. In: CVPR

  52. Zhong Y, Wang L, Chen J, Yu D, Li Y (2020) Comprehensive image captioning via scene graph decomposition. In: ECCV

  53. Deshpande A, Aneja J, Wang L, Schwing AG, Forsyth D (2019) Fast, diverse and accurate image captioning guided by part-of-speech. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  54. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024

  55. Zhiping W, Baoyou Z, Yiwei L (2020) An improved LSTM model in the application of image caption generation. Comput Modernization 296(4):37–41

    Google Scholar 

  56. Plummer BA, Wang L, Cervantes CM et al (2015) Flickr30k entities:Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the international conference on computer vision

  57. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition

Download references

Acknowledgements

This work was supported by the Local College Capacity Building Project of Shanghai Municipal Science and Technology Commission under Grant 20020500700.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Runxia Yang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shao, J., Yang, R. Controllable image caption with an encoder-decoder optimization structure. Appl Intell 52, 11382–11393 (2022). https://doi.org/10.1007/s10489-021-02988-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02988-x

Keywords

Navigation