Abstract
Controllable image caption, which belongs to the intersection of Computer Vision (CV) and Natural Language Process (NLP), is an important part of applying artificial intelligence to many life scenes. We adopt an encoder-decoder structure, which considers visual models as the encoder and regards language models as the decoder. In this work, we introduce a new feature extraction model, namely FVC R-CNN, to learn both the salient features and the visual commonsense features. Furthermore, a novel MT-LSTM neural network for sentence generation is proposed, which is activated by m-tanh and is superior to the traditional Long Short-term memory Network (LSTM) by a significant margin. Finally, we put forward a multi-branch decision strategy to optimize the output. The experimental results are conducted on the widely used COCO Entities dataset, which demonstrates that the proposed method simultaneously outperforms the baseline, surpassing the state-of-the-art methods under a wide range of evaluation metrics. There are CIDEr and SPICE respectively achieves 206.3 and 47.6, yield state-of-the-art (SOTA) performance.
Similar content being viewed by others
References
Kulkarni G, Premraj V, Ordonez V, et al. (2013) Babytalk: Understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Kuznetsova P, Ordonez V, Berg AC et al (2012) Collective generation of natural image descriptions. In: Proceedings of the 50th annual meeting of the association for computational linguistics: Long Papers-Volume 1. Association for Computational Linguistics, pp 359–368
Kuznetsova P, Ordonez V, Berg T et al (2014) Tree talk: Composition and compression of trees for image descriptions. Trans Assoc Comput Linguist 2(1):351–362
Kuznetsova P, Ordonez V, Berg A et al (2013) Generalizing image captions for image-text parallel corpus. In: Proceedings of the 51st annual meeting of the association for computational linguistics (Volume 2: Short Papers), pp 790–796
Mao J, Xu W, Yang Y et al (2014) Explain images with multimodal recurrent neural networks. arXiv:1410.1090
Cho K, Van Merrienboer B, Gulcehre C et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078
Karpathy A, Li F-F (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Boston, USA, pp 3128–3137
Vinyals O, Toshev A, Bengio S et al (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on 1636 computer vision and pattern recognition (CVPR), Boston, USA, pp 3156–3164
Xu K, Jimmy LB (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32th international conference on machine learning, Lille, France, pp 2048–2057
Jia X, Gavves E, Fernando B et al (2015) Guiding the long-short term memory model for image caption generation. In: IEEE international conference on computer vision, pp 2407–2415
Quanzeng Y, Hailin J, Wang ZW et al (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, USA, pp 4651–4659
Chen L, Zhang H, Xiao J et al (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, pp 5659–5667
Lu J, Yang J, Batra D et al (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Salt Lake City, USA, pp 7219– 7228
Cornia M, Baraldi L, Cucchiara R (2019) Show, control and tell: A framework for generating controllable and grounded captions. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Chen S, Jin Q, Wang P et al (2020) Say as you wish: fine-grained control of image caption generation with abstract scene graphs. In: CVPR
Deng C, Ding N, Tan M et al (2020) Length-controllable image captioning. arXiv:2005.14386
Lindh A, Ross R, Kelleher JD (2020) Language-driven region pointer advancement for controllable image captioning. arXiv:2011.14901
Zhu Z, Wang T, Qu H (2021) Macroscopic control of text generation for image captioning. arXiv:2101.08000
Gu J, Cai J, Joty S et al (2018) Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In: 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Liu Y, Guo Y, Bakker EM et al (2017) Learning a recurrent residual fusion network for multimodal matching. In: IEEE international conference on computer vision
Wang L, Li Y, Huang Y, Lazebnik S (2017) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell PP(99):1–1
Wu Y, Wang S, Huang Q (2018) Learning semantic structure-preserved embeddings for cross-modal retrieval. In: 2018, ACM multimedia conference
Huang F, Zhang X, Zhao Z et al (2018) Bi-directional spatial-semantic attention networks for image-text matching. IEEE Trans Image Process PP(4):1–1
Lee KH, Xi C, Gang H et al (2018) Stacked cross attention for image-text matching
Wang Y, Yang H, Qian X et al (2019) Position focused attention network for image-text matching
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. IntConf Learn Representations
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Computer vision and pattern recognition. IEEE, pp 770–778
Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake, USA, pp 6077–6086
Shaoqing R, He K et al, GirshickK R (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Girshick R, Donahue J, Darrell T et al (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Girshick R (2015) Fast R-CNN. Comput Sci
Wang T, Huang J, Zhang H et al (2020) Visual commonsense R-CNN. In: 2020, IEEE conference on computer vision and pattern recognition (CVPR)
Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci 79(8):2554–2558
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Sig Process 45(11):2673–2681
Hochreiter S, Schmidhuber J (1997) Long short-termmemory. Neural Comput 9(8):1735–1780
Lu J, Xiong C, Parikh D et al (2017) Knowing when to look: Adaptive attention via A visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, Hawaii, USA, pp 375–383
Godin F, Degrave J, Dambre J et al (2018) Dual rectified linear units (DReLUs): A replacement for tanh activation functions in quasi-recurrent neural networks. Pattern Recogn Lett 116(12):8–14
Papineni K, Roukos S, Ward S et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, p 5
Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation
Lin C-Y (2004) ROUGE: A package for automatic evaluation of summaries. In: Text summarization branches out. Association for Computational Linguistics, Barcelona, Spain, p 5
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR)
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. Adapt Behav 11(4):382–398
Liu W, Chen S, Guo L et al (2021) CPTR: Full transformer network for image captioning. arXiv:2101.10804v3
Shi Z, Zhou X, Qiu X, Zhu X (2020) Improving image captioning with better use of captions. arXiv:2006.11807v1
Huang L, Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. In: 2019 international joint conference on neural networks (IJCNN)
Gan C, Gan Z, He X, Gao J, Deng L (2017) Stylenet: Generating attractive visual captions with styles. In: CVPR
Mathews A, Xie L, He X (2016) Senticap: Generating image descriptions with sentiments. In: AAAI
Mathews A, Xie L, He X (2018) Semstyle: Learning to generate stylised image captions using unaligned text. In: CVPR
Shuster K, Humeau K, Hu H, Bordes A, Weston J (2019) Engaging image captioning via personality. In: CVPR
Kim D-J, Choi J, Oh T-H, Kweon IS (2019) Dense relational captioning: Triple-stream networks for relationship-based captioning. In: CVPR
Zheng Y, Li Y, Wang S (2019) Intention oriented image captions with guiding objects. In: CVPR
Zhong Y, Wang L, Chen J, Yu D, Li Y (2020) Comprehensive image captioning via scene graph decomposition. In: ECCV
Deshpande A, Aneja J, Wang L, Schwing AG, Forsyth D (2019) Fast, diverse and accurate image captioning guided by part-of-speech. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024
Zhiping W, Baoyou Z, Yiwei L (2020) An improved LSTM model in the application of image caption generation. Comput Modernization 296(4):37–41
Plummer BA, Wang L, Cervantes CM et al (2015) Flickr30k entities:Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the international conference on computer vision
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Acknowledgements
This work was supported by the Local College Capacity Building Project of Shanghai Municipal Science and Technology Commission under Grant 20020500700.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Shao, J., Yang, R. Controllable image caption with an encoder-decoder optimization structure. Appl Intell 52, 11382–11393 (2022). https://doi.org/10.1007/s10489-021-02988-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02988-x