Improvement of image description using bidirectional LSTM

Chahkandi, Vahid; Fadaeieslam, Mohammad Javad; Yaghmaee, Farzin

doi:10.1007/s13735-018-0158-y

Improvement of image description using bidirectional LSTM

Regular Paper
Published: 19 July 2018

Volume 7, pages 147–155, (2018)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Vahid Chahkandi¹,
Mohammad Javad Fadaeieslam¹ &
Farzin Yaghmaee¹

473 Accesses
7 Citations
Explore all metrics

Abstract

As a high-level technique, automatic image description combines linguistic and visual information in order to extract an appropriate caption for an image. In this paper, we have proposed a method based on a recurrent neural network to synthesize descriptions in multimodal space. The innovation of this paper consists in generating sentences with variable length and novel structures. The Bi-LSTM network has been applied to achieve this purpose. This paper utilizes the inner product as common space, which reduces the computational cost and improves results. We have evaluated the performance of the proposed method on benchmark datasets: Flickr8K and Flickr30K. The results demonstrate that Bi-LSTM has better efficiency, as compared to the unidirectional model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Coyne B, Sproat R (2001) Wordseye: an automatic text-to-scene conversion system. In: SIGGRAPH’01
Das P, Xu C, Doell RF, Corso JJ (2013) A thousand frames in just a few words: lingual description of videos through latent topic and sparse object stitching. In: CVPR
Krishnamoorthy N, Malkarnenkar G, Mooney RJ, Saenko K, Guadarrama S (2013) Generating natural-language video descriptions using text-mined knowledge. In: AAAI, vol 1
Karpathy A, Joulin A, Li F-F (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems
Wang C, Yang H, Bartz C, Meinel C (2016) Image captioning with deep bidirectional LSTMs. In: Proceedings of the 2016 ACM on multimedia conference. ACM, Oct 2016, pp 988–997
Bernardi R, Cakici R, Elliott D, Erdem A, Erdem E, Cinbis NI, Keller F, Muscat A, Plank B (2016) Automatic description generation from images: a survey of models, datasets, and evaluation measures. J Artif Intell Res (JAIR) 55:409–442
Article Google Scholar
Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A, Yamaguchi K, Berg T, Stratos K, Daumé H III (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics. Association for computational linguistics
Kuznetsova P, Ordonez V, Berg TL, Choi Y (2014) TREETALK: composition and compression of trees for image descriptions. In: Conference on empirical methods in natural language processing
Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2015) Deep captioning with multimodal recurrent neural networks (m-RNN). In: International conference on learning representations
Elliott D, Keller F (2013) Image description using visual dependency representations. In: Proceedings of the 2013 conference on empirical methods in natural language processing
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Article Google Scholar
Li S, Kulkarni G, Berg TL, Berg AC, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of the fifteenth conference on computational natural language learning. Association for computational linguistics
Yang Y, Teo CL, Daumé H III Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the conference on empirical methods in natural language processing. Association for computational linguistics
Ordonez V, Kulkarni G, Berg TL (2011) Im2text: describing images using 1 million captioned photographs. In: Advances in neural information processing systems
Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers, vol 1. Association for computational linguistics
Patterson G, Xu C, Su H, Hays J (2014) The sun attribute database: beyond categories for deeper scene understanding. Int J Comput Vis 108(1–2):59–81
Article Google Scholar
Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: AAAI
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences for images. In: ECCV
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899
Article MathSciNet MATH Google Scholar
Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguist 2:207–218
Google Scholar
Karpathy A, Li F-F (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Xu K, Ba J, Kiros R, Courville A, Salakhutdinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: ICML
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: CVPR, pp 3156–3164
Chen X, Zitnick CL (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: CVPR, pp 2422–2431

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Semnan University, Semnan, Iran
Vahid Chahkandi, Mohammad Javad Fadaeieslam & Farzin Yaghmaee

Authors

Vahid Chahkandi
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Javad Fadaeieslam
View author publications
You can also search for this author in PubMed Google Scholar
Farzin Yaghmaee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammad Javad Fadaeieslam.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chahkandi, V., Fadaeieslam, M.J. & Yaghmaee, F. Improvement of image description using bidirectional LSTM. Int J Multimed Info Retr 7, 147–155 (2018). https://doi.org/10.1007/s13735-018-0158-y

Download citation

Received: 11 December 2017
Revised: 02 June 2018
Accepted: 13 July 2018
Published: 19 July 2018
Issue Date: September 2018
DOI: https://doi.org/10.1007/s13735-018-0158-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improvement of image description using bidirectional LSTM

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Comparison Between LSTM and Transformers for Image Captioning

Multi-modal gated recurrent units for image description

Automatic Generation of Description for Images Using Recurrent Neural Network

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Improvement of image description using bidirectional LSTM

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Comparison Between LSTM and Transformers for Image Captioning

Multi-modal gated recurrent units for image description

Automatic Generation of Description for Images Using Recurrent Neural Network

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now