Fast image captioning using LSTM

Han, Meng; Chen, Wenyu; Moges, Alemu Dagmawi

doi:10.1007/s10586-018-1885-9

Fast image captioning using LSTM

Published: 29 March 2018

Volume 22, pages 6143–6155, (2019)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Meng Han¹,
Wenyu Chen² &
Alemu Dagmawi Moges²

728 Accesses
12 Citations
Explore all metrics

Abstract

Computer vision and natural language processing have been some of the long-standing challenges in artificial intelligence. In this paper, we explore a generative automatic image annotation model, which utilizes recent advances on both fronts. Our approach makes use of a deep-convolutional neural network to detect image regions, which later will be fed to recurrent neural network that is trained to maximize the likely-hood of the target sentence description of the given image. During our experimentation we found that better accuracy and training was achieved when the image representation from our detection model is coupled with the input word embedding, we also found out most of the information from the last layer of detection model vanishes when it is fed as thought vector for our LSTM decoder. This is mainly because the information within the last fully connected layer of the YOLO model represents the class probabilities for the detected objects and their bounding box and this information is not rich enough. We trained our model on coco benchmark for 60 h on 64,000 training and 12,800-validation dataset achieving 23% accuracy. We also realized a significant training speed drop when we changed the number of hidden units in the LSTM layer from 1470 to 4096.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Picard, R.W., Minka, T.P.: Vision texture for annotation. Multimed. Syst. 3(1), 3–14 (1995)
Article Google Scholar
Cusano, C., Bicocca, M., Bicocca, V.: Image annotation using SVM. Proc. SPIE 1, 330–338 (2003)
Article Google Scholar
Tang, J., Lewis, P.H.: A study of quality issues for image auto-annotation with the corel dataset. IEEE Trans. Circuits Syst. Video Technol. 17(3), 384–389 (2007)
Article Google Scholar
Li, J., Wang, J.Z.: Real-time computerized annotation of pictures. IEEE Trans. Pattern Anal. Mach. Intell. 30(6), 985–1002 (2008)
Article Google Scholar
Li, J., Wang, J.Z.: Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1075–1088 (2003)
Article Google Scholar
Jeon, J., Manmatha, R.: Using maximum entropy for automatic image annotation. Proc. CVIR Lect. Notes Comput. Sci. 3115, 24–32 (2004)
Article Google Scholar
Vinyals, O., Toshev, A., Bengio, S., et al.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, MA, 07–12 June, pp. 3156–3164 (2015)
Karpathy, A., Li, F.F.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 07–12 June, pp. 3128–3137 (2015)
Kulkarni, G., Premraj, V., Ordonez, V., et al.: Baby talk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)
Article Google Scholar
Girshick, R., Donahue, .J, Darrell, T., et al.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Ren, S., He, K., Girshick, R., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Article Google Scholar
Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, 4–9 Feb, pp. 4278–4284 (2017)
Redmon, J., Divvala, S., Girshick, R., et al.: You only look once: unified, real-time object detection. In: CVPR 2016, pp. 779–788 (2016)
Karpathy, A., Johnson, J., Fei-Fei, L.: Visualizing and understanding recurrent networks. arXiv:1506.02078 (2015)
Chung, J., Gulcehre, C., Cho, K., et al.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 (2014)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, 2014, pp. 3104–3112 (2014)

Download references

Author information

Authors and Affiliations

School of Political Science and Public Administration, University of Electronic Science & Technology of China, Chengdu, People’s Republic of China
Meng Han
School of Computer Science and Engineering, University of Electronic Science & Technology of China, Chengdu, People’s Republic of China
Wenyu Chen & Alemu Dagmawi Moges

Authors

Meng Han
View author publications
You can also search for this author in PubMed Google Scholar
Wenyu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Alemu Dagmawi Moges
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenyu Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Han, M., Chen, W. & Moges, A.D. Fast image captioning using LSTM. Cluster Comput 22 (Suppl 3), 6143–6155 (2019). https://doi.org/10.1007/s10586-018-1885-9

Download citation

Received: 14 November 2017
Revised: 08 January 2018
Accepted: 17 January 2018
Published: 29 March 2018
Issue Date: May 2019
DOI: https://doi.org/10.1007/s10586-018-1885-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast image captioning using LSTM

Abstract

Access this article

Similar content being viewed by others

A Neural Network Framework to Generate Caption from Images

A Comprehensive Review on Image Captioning Using Deep Learning

Automatic Image Captioning Using Ensemble of Deep Learning Techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fast image captioning using LSTM

Abstract

Access this article

Similar content being viewed by others

A Neural Network Framework to Generate Caption from Images

A Comprehensive Review on Image Captioning Using Deep Learning

Automatic Image Captioning Using Ensemble of Deep Learning Techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation