Abstract
Automatic caption generation from images is an interesting and mainstream direction in the field of machine learning. This method enables us to build a powerful computer model that can interpret the implicit semantic information of images. However, the current state of research faces significant challenges such as those related to extracting robust image features, suppressing noisy words, and improving a caption’s coherence. For the first problem, a novel computer vision system is presented to create a new image feature called MK–KDES-1 (MK–KDES represents Multiple Kernel–Kernel Descriptors) after extracting three KDES features and fusing them by MKL (Multiple Kernel Learning) model. The MK–KDES-1 feature captures both textural characteristics and shape characteristics of images, which contribute to improving the BLEU_1 (BLEU represents Bilingual Evaluation Understudy) scores of captions. For the second problem, an effective newly designed two-layer TR (Tag Refinement) strategy is integrated into our NLG (Natural Language Generation) algorithm. Words that are most relevant semantically to images are summarized to generate N-gram phrases. Noisy words are suppressed using the innovative TR strategy. For the last problem, on the one hand, a pop WE (Word Embeddings) model and a novel metric called PDI (Positive Distance Information) are introduced together to generate N-gram phrases. The phrases are evaluated by the AWSC (Accumulated Word Semantic Correlation) metric. On the other hand, the phrases are fused to generate captions by the ST (Syntactic Trees). Experimental results demonstrate that informative captions with high BLEU_3 scores can be obtained to describe images.
Similar content being viewed by others
References
Aker A, Gaizauskas R (2010) Generating image descriptions using dependency relational patterns. In Proceedings of annual meeting of the Association for Computational Linguistics
Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of computer vision and pattern recognition
Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: Proceedings of international conference on machine learning, JMLR W&CP, vol 28, no. 3, pp 1247–1255
Berg TL, Berg AC, Shih J (2010) Automatic attribute discovery and characterization from noisy web data. In: Proceedings of European conference on computer vision
Blei DM, Ng AY, Jordan MJ (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Bo L, Ren X, Fox D (2009) Efficient match kernels between sets of features for visual recognition. In: Proceedings of advances in neural information processing systems, pp 135–143
Bo L, Ren X, Fox D (2010) Kernel descriptors for visual recognition. In Proceedings of advances in neural information processing systems, pp 1734–1742
Chen K, Gao J, Nevatia R (2018) Knowledge aided consistency for weakly supervised phrase grounding. In: Proceedings of computer vision and pattern recognition
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 886–893
Yang Y, Teo CL, Daume H, III, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of conference on empirical methods on natural language processing, pp 444–454
Deshpande A, Aneja J, Wang L, Schwing A, Forsyth DA (2018) Diverse and controllable image captioning with part-of-speech guidance. In: Proceedings of advances in neural information processing systems
Devlin J, Cheng H, Fang H, Gupta S, Deng L, He X, Zweig G, Mitchell M (2015) Language models for image captioning: the quirks and what works. In: Proceedings of annual meeting of the Association for Computational Linguistics
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of IEEE conference on computer vision and pattern recognition
Elamri C, de Planque T (2016) Automated neural image caption generator for visually impaired people [EB/OL]. Stanford CS224D
Elliott D, de Vries AP (2015) Describing images using inferred visual dependency representations. In: Proceedings of Annual Meeting of the Association for Computational Linguistics
Fang H, Gupta S, Iandola F, Srivastava R, Deng L, Dollar P, Gao J, He X, Mitchell M, Platt J, Zitnick CL, Zweig G (2015) From captions to visual concepts and back. In: Proceedings of IEEE conference on computer vision and pattern recognition
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: Proceedings of European Conference on Computer Vision, pp 15–29
Feng Y, Lapata M (2013) Automatic caption generation for news images. IEEE Trans Pattern Anal Mach Intell 35(4):797–812
Gan Z, Gan C, He X et al (2017) Semantic compositional networks for visual captioning. In: Proceedings of computer vision and pattern recognition
Gu J, Cai J, Wang G et al (2017) Stack-captioning: coarse-to-fine learning for image captioning. In: Proceedings of computer vision and pattern recognition
Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: Proceedings of AAAI conference on artificial intelligence
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of IEEE international conference on computer vision, pp 770–778
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5186):504–507
Hinton GE, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural Comput 18:1527–1554
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899
Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(1):177–196
Hwang S, Grauman K (2012) Learning the relative importance of objects from tagged images for retrieval and cross-modal search. Int J Comput Vis 100(2):134–153
Jeon J, Lavrenko V, Manmatha R (2003) Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the Special Interest Group on Information Retrieval, pp 119–126
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of international conference on computer vision
Karpathy A, Joulin A, Fei-Fei L (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Proceedings of advances in neural information processing systems
Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: Proceedings of international conference on machine learning, JMLR Workshop, pp 595–603
Kiros R, Salakhutdinov R, Zemel RS (2015) Unifying visual-semantic embeddings with multimodal neural language models. In: Proceedings of advances in neural information processing systems deep learning workshop
Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classification with deep convolutional neural networks. In: Proceedings of conference on advances in neural information processing systems, pp 1106–1114
Kulkarni G, Premraj V, Dhar S et al (2013) Baby talk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: Proceedings of annual meeting of the Association for Computational Linguistics
Lebret R, Pinheiro PO, Collobert R (2015) Phrase-based image captioning. In: Proceedings of international conference on machine learning
Li S, Kulkarni G, Berg TL, Berg AC, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of conference on natural language learning
Li P, Ma J, Gao S (2012) Learning to summarize web image and text mutually. In: Proceedings of international conference on multimedia retrieval
Li D, Huang Q, He X (2018) Generating diverse and accurate visual captions by comparative adversarial learning [EB/OL]. arXiv.org
Liu C, Mao J, Sha F, Yuille A (2017) Attention correctness in neural image captioning. In: Proceedings of AAAI conference on artificial intelligence
Liu X, Li H, Shao J (2018) Show, tell and discriminate: image captioning by self-retrieval with partially labeled data. In: Proceedings of European conference on computer vision
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE conference on computer vision and pattern recognition
Lowe David G (2004) Distinctive image features from scale-invariant key points. Int J Comput Vision 60(2):91–110
Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the annual meeting of the Association for Computational Linguistics: Human Language Technologies, pp 142–150
Makadia A, Pavlovic V, Kumar S (2008) A new baseline for image annotation. In: Proceedings of European conference on computer vision, pp 316–329
Mansimov E, Parisotto E, Ba JL et al (2016) Generating images from captions with attention. In: Proceedings of international conference on learning representations
Mao J, Xu W, Yang Y, Wang J, Yuille AL (2015) Deep captioning with multimodal recurrent neural networks (m-RNN). In: Proceedings of international conference on learning representations
Mason R (2013) Domain-independent captioning of domain-specific images. In: Proceedings of North American Association for Computational Linguistics, pp 69–76
Mason R, Charniak E (2013) Annotation of online shopping images without labeled training examples. In: Proceedings of human language technologies: conference of the North American Chapter of the Association of Computational Linguistics
Mikolov T, Sutskever I, Chen K et al (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of conference on advances in neural information processing systems
Mitchell M, Dodge J, Goyal A et al (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of European Association for Computational Linguistics, pp 747–756
Monay F, Gatica-Perez D (2004) PLSA-based image auto annotation: constraining the latent space. In: Proceedings of ACM international conference on multimedia, pp 348–351
Mukuta Y, Harada T (2014) Probabilistic partial canonical correlation analysis. In: Proceedings of international conference on machine learning, pp 1449–1457
Ojala T, Pietikainen M, Maenpaa T (2002) Multi-resolution grayscale and rotation invariant texture classification with local binary patterns. IEEE 24(7):971–987
Oliva A, Torralba A (2006) Building the gist of a scene: the role of global image features in recognition. Prog Brain Res Visual Percept 155:23–36
Ordonez V, Kulkarni G, Berg TL (2011) Im2Text: describing images using 1 million captioned photographs. In: Conference on neural information processing systems
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of annual meeting of the Association for Computational Linguistics, pp 311–318
Pedersoli M, Lucas T, Schmid C, Verbeek J (2017) Areas of attention for image captioning. In: Proceedings of international conference on computer vision
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of conference on empirical methods on natural language processing
Pinheiro P, Lebret R, Collobert R (2015) Simple image description generator via a linear phrase-based model. In: Proceedings of international conference on learning representations workshop
Quan R, Han J, Zhang D, Nie F (2016) Object co-segmentation via graph optimized-flexible manifold ranking. In: Proceedings of computer vision and pattern recognition
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: Proceedings of MICCAI
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of international conference on learning representations
Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguist 2:207–218
Sutskever I, Vinyals O, Le Q (2014) Sequence to sequence learning with neural networks. In: Proceedings of NIPS
Ushiku Y, Harada T, Kuniyoshi Y (2011) Automatic sentence generation from images. In: Proceedings of ACM multimedia conference, pp 1533–1536
Vedaldi A, Gulshan V, Varma M, Zisserman A (2009) Multiple kernels for object detection. In: Proceedings of international conference on computer vision, pp 606–613
Vittayakorn S, Umeda T, Murasaki K, Sudo K, Okatani T, Yamaguchi K (2016) Automatic attribute discovery with neural activations. In: Proceedings of European conference on computer vision
Wang Q, Chan AB (2018) CNN + CNN: convolutional decoders for image captioning [EB/OL]. arXiv.org
Wang J, Madhyastha P, Specia L (2018) Object counts! bringing explicit detections back into image captioning. In: Proceedings of North American Chapter of the Association for Computational Linguistics
Wu Q, Shen C, Liu L, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?. In: Proceedings of computer vision and pattern recognition
Xiao F, Sigal L, Lee YJ (2017) Weakly-supervised visual grounding of phrases with linguistic structures. In: Proceedings of computer vision and pattern recognition
Xu K, Ba JL, Kiros R et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of international conference on machine learning
Xu N, Price B, Cohen S, Yang J, Huang TS (2016) Deep interactive object selection. In Proceedings of computer vision and pattern recognition
Yang J, Yu K, Gong Y et al. (2009) Linear spatial pyramid matching using sparse coding for image classification. In: IEEE conference on computer vision and pattern recognition, pp 1794–1801
Yao B, Yang X, Lin L, Lee MW, Zhu S-C (2010) I2t: image parsing to text description. Proc IEEE 98(8):1485–1508
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of computer vision and pattern recognition
You Q, You Q, Luo J 2018) Image captioning at will: a versatile scheme for effectively injecting sentiments into image descriptions [EB/OL]. arXiv.org
Zhang H, Ji D, Yin L, Ren Y, Niu Z (2016) Caption generation from product image based on tag refinement and syntactic tree. J Comput Res Dev 53(11):2542–2555
Zhang Z, Xie Y, Xing F et al. (2017) MDNet: a semantically and visually interpretable medical image diagnosis network. In: Proceedings of IEEE conference on computer vision and pattern recognition
Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D, Huang C, Torr PHS (2015) Conditional random fields as recurrent neural networks. In: Proceedings of international conference on computer vision
Acknowledgements
I would like to express my warmest gratitude to Yi Yin, my first graduate student, for her valuable work on the writing of the original manuscript. Our work is supported by the National Natural Science Foundation of China under Grant Nos. 61762038, 61741108 and 61861016, the Humanity and Social Science Foundation of the Ministry of Education under Grant Nos. 17YJAZH117 and 16YJAZH029, the Natural Science Foundation of Jiangxi under Grant No. 20171BAB202023, the Key Research and Development Plan of Jiangxi Provincial Science and Technology Department under Grant No. 20171BBG70093, the Humanity and Social Science Foundation of Jiangxi Province under Grant No. 16TQ02, the Humanity and Social Science Foundation of Jiangxi University under Grant No. TQ1503, XW1502.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, H., Qiu, D., Wu, R. et al. Novel model to integrate word embeddings and syntactic trees for automatic caption generation from images. Soft Comput 24, 1377–1397 (2020). https://doi.org/10.1007/s00500-019-03973-w
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-019-03973-w