Skip to main content
Log in

Novel model to integrate word embeddings and syntactic trees for automatic caption generation from images

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Automatic caption generation from images is an interesting and mainstream direction in the field of machine learning. This method enables us to build a powerful computer model that can interpret the implicit semantic information of images. However, the current state of research faces significant challenges such as those related to extracting robust image features, suppressing noisy words, and improving a caption’s coherence. For the first problem, a novel computer vision system is presented to create a new image feature called MK–KDES-1 (MK–KDES represents Multiple Kernel–Kernel Descriptors) after extracting three KDES features and fusing them by MKL (Multiple Kernel Learning) model. The MK–KDES-1 feature captures both textural characteristics and shape characteristics of images, which contribute to improving the BLEU_1 (BLEU represents Bilingual Evaluation Understudy) scores of captions. For the second problem, an effective newly designed two-layer TR (Tag Refinement) strategy is integrated into our NLG (Natural Language Generation) algorithm. Words that are most relevant semantically to images are summarized to generate N-gram phrases. Noisy words are suppressed using the innovative TR strategy. For the last problem, on the one hand, a pop WE (Word Embeddings) model and a novel metric called PDI (Positive Distance Information) are introduced together to generate N-gram phrases. The phrases are evaluated by the AWSC (Accumulated Word Semantic Correlation) metric. On the other hand, the phrases are fused to generate captions by the ST (Syntactic Trees). Experimental results demonstrate that informative captions with high BLEU_3 scores can be obtained to describe images.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Aker A, Gaizauskas R (2010) Generating image descriptions using dependency relational patterns. In Proceedings of annual meeting of the Association for Computational Linguistics

  • Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of computer vision and pattern recognition

  • Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: Proceedings of international conference on machine learning, JMLR W&CP, vol 28, no. 3, pp 1247–1255

  • Berg TL, Berg AC, Shih J (2010) Automatic attribute discovery and characterization from noisy web data. In: Proceedings of European conference on computer vision

  • Blei DM, Ng AY, Jordan MJ (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Bo L, Ren X, Fox D (2009) Efficient match kernels between sets of features for visual recognition. In: Proceedings of advances in neural information processing systems, pp 135–143

  • Bo L, Ren X, Fox D (2010) Kernel descriptors for visual recognition. In Proceedings of advances in neural information processing systems, pp 1734–1742

  • Chen K, Gao J, Nevatia R (2018) Knowledge aided consistency for weakly supervised phrase grounding. In: Proceedings of computer vision and pattern recognition

  • Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 886–893

  • Yang Y, Teo CL, Daume H, III, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of conference on empirical methods on natural language processing, pp 444–454

  • Deshpande A, Aneja J, Wang L, Schwing A, Forsyth DA (2018) Diverse and controllable image captioning with part-of-speech guidance. In: Proceedings of advances in neural information processing systems

  • Devlin J, Cheng H, Fang H, Gupta S, Deng L, He X, Zweig G, Mitchell M (2015) Language models for image captioning: the quirks and what works. In: Proceedings of annual meeting of the Association for Computational Linguistics

  • Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of IEEE conference on computer vision and pattern recognition

  • Elamri C, de Planque T (2016) Automated neural image caption generator for visually impaired people [EB/OL]. Stanford CS224D

  • Elliott D, de Vries AP (2015) Describing images using inferred visual dependency representations. In: Proceedings of Annual Meeting of the Association for Computational Linguistics

  • Fang H, Gupta S, Iandola F, Srivastava R, Deng L, Dollar P, Gao J, He X, Mitchell M, Platt J, Zitnick CL, Zweig G (2015) From captions to visual concepts and back. In: Proceedings of IEEE conference on computer vision and pattern recognition

  • Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: Proceedings of European Conference on Computer Vision, pp 15–29

    Chapter  Google Scholar 

  • Feng Y, Lapata M (2013) Automatic caption generation for news images. IEEE Trans Pattern Anal Mach Intell 35(4):797–812

    Article  Google Scholar 

  • Gan Z, Gan C, He X et al (2017) Semantic compositional networks for visual captioning. In: Proceedings of computer vision and pattern recognition

  • Gu J, Cai J, Wang G et al (2017) Stack-captioning: coarse-to-fine learning for image captioning. In: Proceedings of computer vision and pattern recognition

  • Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: Proceedings of AAAI conference on artificial intelligence

  • He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of IEEE international conference on computer vision, pp 770–778

  • Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5186):504–507

    Article  MathSciNet  Google Scholar 

  • Hinton GE, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural Comput 18:1527–1554

    Article  MathSciNet  Google Scholar 

  • Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899

    Article  MathSciNet  Google Scholar 

  • Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(1):177–196

    Article  Google Scholar 

  • Hwang S, Grauman K (2012) Learning the relative importance of objects from tagged images for retrieval and cross-modal search. Int J Comput Vis 100(2):134–153

    Article  MathSciNet  Google Scholar 

  • Jeon J, Lavrenko V, Manmatha R (2003) Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the Special Interest Group on Information Retrieval, pp 119–126

  • Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of international conference on computer vision

  • Karpathy A, Joulin A, Fei-Fei L (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Proceedings of advances in neural information processing systems

  • Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: Proceedings of international conference on machine learning, JMLR Workshop, pp 595–603

  • Kiros R, Salakhutdinov R, Zemel RS (2015) Unifying visual-semantic embeddings with multimodal neural language models. In: Proceedings of advances in neural information processing systems deep learning workshop

  • Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classification with deep convolutional neural networks. In: Proceedings of conference on advances in neural information processing systems, pp 1106–1114

  • Kulkarni G, Premraj V, Dhar S et al (2013) Baby talk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903

    Article  Google Scholar 

  • Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: Proceedings of annual meeting of the Association for Computational Linguistics

  • Lebret R, Pinheiro PO, Collobert R (2015) Phrase-based image captioning. In: Proceedings of international conference on machine learning

  • Li S, Kulkarni G, Berg TL, Berg AC, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of conference on natural language learning

  • Li P, Ma J, Gao S (2012) Learning to summarize web image and text mutually. In: Proceedings of international conference on multimedia retrieval

  • Li D, Huang Q, He X (2018) Generating diverse and accurate visual captions by comparative adversarial learning [EB/OL]. arXiv.org

  • Liu C, Mao J, Sha F, Yuille A (2017) Attention correctness in neural image captioning. In: Proceedings of AAAI conference on artificial intelligence

  • Liu X, Li H, Shao J (2018) Show, tell and discriminate: image captioning by self-retrieval with partially labeled data. In: Proceedings of European conference on computer vision

  • Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE conference on computer vision and pattern recognition

  • Lowe David G (2004) Distinctive image features from scale-invariant key points. Int J Comput Vision 60(2):91–110

    Article  Google Scholar 

  • Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the annual meeting of the Association for Computational Linguistics: Human Language Technologies, pp 142–150

  • Makadia A, Pavlovic V, Kumar S (2008) A new baseline for image annotation. In: Proceedings of European conference on computer vision, pp 316–329

    Google Scholar 

  • Mansimov E, Parisotto E, Ba JL et al (2016) Generating images from captions with attention. In: Proceedings of international conference on learning representations

  • Mao J, Xu W, Yang Y, Wang J, Yuille AL (2015) Deep captioning with multimodal recurrent neural networks (m-RNN). In: Proceedings of international conference on learning representations

  • Mason R (2013) Domain-independent captioning of domain-specific images. In: Proceedings of North American Association for Computational Linguistics, pp 69–76

  • Mason R, Charniak E (2013) Annotation of online shopping images without labeled training examples. In: Proceedings of human language technologies: conference of the North American Chapter of the Association of Computational Linguistics

  • Mikolov T, Sutskever I, Chen K et al (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of conference on advances in neural information processing systems

  • Mitchell M, Dodge J, Goyal A et al (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of European Association for Computational Linguistics, pp 747–756

  • Monay F, Gatica-Perez D (2004) PLSA-based image auto annotation: constraining the latent space. In: Proceedings of ACM international conference on multimedia, pp 348–351

  • Mukuta Y, Harada T (2014) Probabilistic partial canonical correlation analysis. In: Proceedings of international conference on machine learning, pp 1449–1457

  • Ojala T, Pietikainen M, Maenpaa T (2002) Multi-resolution grayscale and rotation invariant texture classification with local binary patterns. IEEE 24(7):971–987

    Google Scholar 

  • Oliva A, Torralba A (2006) Building the gist of a scene: the role of global image features in recognition. Prog Brain Res Visual Percept 155:23–36

    Article  Google Scholar 

  • Ordonez V, Kulkarni G, Berg TL (2011) Im2Text: describing images using 1 million captioned photographs. In: Conference on neural information processing systems

  • Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of annual meeting of the Association for Computational Linguistics, pp 311–318

  • Pedersoli M, Lucas T, Schmid C, Verbeek J (2017) Areas of attention for image captioning. In: Proceedings of international conference on computer vision

  • Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of conference on empirical methods on natural language processing

  • Pinheiro P, Lebret R, Collobert R (2015) Simple image description generator via a linear phrase-based model. In: Proceedings of international conference on learning representations workshop

  • Quan R, Han J, Zhang D, Nie F (2016) Object co-segmentation via graph optimized-flexible manifold ranking. In: Proceedings of computer vision and pattern recognition

  • Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: Proceedings of MICCAI

  • Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of international conference on learning representations

  • Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguist 2:207–218

    Article  Google Scholar 

  • Sutskever I, Vinyals O, Le Q (2014) Sequence to sequence learning with neural networks. In: Proceedings of NIPS

  • Ushiku Y, Harada T, Kuniyoshi Y (2011) Automatic sentence generation from images. In: Proceedings of ACM multimedia conference, pp 1533–1536

  • Vedaldi A, Gulshan V, Varma M, Zisserman A (2009) Multiple kernels for object detection. In: Proceedings of international conference on computer vision, pp 606–613

  • Vittayakorn S, Umeda T, Murasaki K, Sudo K, Okatani T, Yamaguchi K (2016) Automatic attribute discovery with neural activations. In: Proceedings of European conference on computer vision

  • Wang Q, Chan AB (2018) CNN + CNN: convolutional decoders for image captioning [EB/OL]. arXiv.org

  • Wang J, Madhyastha P, Specia L (2018) Object counts! bringing explicit detections back into image captioning. In: Proceedings of North American Chapter of the Association for Computational Linguistics

  • Wu Q, Shen C, Liu L, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?. In: Proceedings of computer vision and pattern recognition

  • Xiao F, Sigal L, Lee YJ (2017) Weakly-supervised visual grounding of phrases with linguistic structures. In: Proceedings of computer vision and pattern recognition

  • Xu K, Ba JL, Kiros R et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of international conference on machine learning

  • Xu N, Price B, Cohen S, Yang J, Huang TS (2016) Deep interactive object selection. In Proceedings of computer vision and pattern recognition

  • Yang J, Yu K, Gong Y et al. (2009) Linear spatial pyramid matching using sparse coding for image classification. In: IEEE conference on computer vision and pattern recognition, pp 1794–1801

  • Yao B, Yang X, Lin L, Lee MW, Zhu S-C (2010) I2t: image parsing to text description. Proc IEEE 98(8):1485–1508

    Article  Google Scholar 

  • You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of computer vision and pattern recognition

  • You Q, You Q, Luo J 2018) Image captioning at will: a versatile scheme for effectively injecting sentiments into image descriptions [EB/OL]. arXiv.org

  • Zhang H, Ji D, Yin L, Ren Y, Niu Z (2016) Caption generation from product image based on tag refinement and syntactic tree. J Comput Res Dev 53(11):2542–2555

    Google Scholar 

  • Zhang Z, Xie Y, Xing F et al. (2017) MDNet: a semantically and visually interpretable medical image diagnosis network. In: Proceedings of IEEE conference on computer vision and pattern recognition

  • Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D, Huang C, Torr PHS (2015) Conditional random fields as recurrent neural networks. In: Proceedings of international conference on computer vision

Download references

Acknowledgements

I would like to express my warmest gratitude to Yi Yin, my first graduate student, for her valuable work on the writing of the original manuscript. Our work is supported by the National Natural Science Foundation of China under Grant Nos. 61762038, 61741108 and 61861016, the Humanity and Social Science Foundation of the Ministry of Education under Grant Nos. 17YJAZH117 and 16YJAZH029, the Natural Science Foundation of Jiangxi under Grant No. 20171BAB202023, the Key Research and Development Plan of Jiangxi Provincial Science and Technology Department under Grant No. 20171BBG70093, the Humanity and Social Science Foundation of Jiangxi Province under Grant No. 16TQ02, the Humanity and Social Science Foundation of Jiangxi University under Grant No. TQ1503, XW1502.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongbin Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, H., Qiu, D., Wu, R. et al. Novel model to integrate word embeddings and syntactic trees for automatic caption generation from images. Soft Comput 24, 1377–1397 (2020). https://doi.org/10.1007/s00500-019-03973-w

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-019-03973-w

Keywords

Navigation