Hierarchical Deep Neural Network for Image Captioning

Su, Yuting; Li, Yuqian; Xu, Ning; Liu, An-An

doi:10.1007/s11063-019-09997-5

Hierarchical Deep Neural Network for Image Captioning

Published: 18 February 2019

Volume 52, pages 1057–1067, (2020)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Yuting Su¹,
Yuqian Li¹,
Ning Xu¹ &
…
An-An Liu¹

840 Accesses
14 Citations
Explore all metrics

Abstract

Automatically describing image content with natural language is a fundamental challenge for computer vision community. General methods used visual information to generate sentences directly. However, only depending on the visual information is not enough to generate the fine-grained descriptions for given images. In this paper, we exploit the fusion of visual information and high-level semantic information for image captioning. We propose a hierarchical deep neural network, which consists of the bottom layer and the top layer. The former extracts the visual and high-level semantic information from image and detected regions, respectively, while the latter integrates both of them with adaptive attention mechanism for the caption generation. The experimental results achieve the competing performances against the state-of-the-art methods on MSCOCO dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Ranjay Krishna, Yuke Zhu, … Li Fei-Fei

Learning to Prompt for Vision-Language Models

Article 31 July 2022

Kaiyang Zhou, Jingkang Yang, … Ziwei Liu

Notes

https://github.com/tylin/coco-caption.

References

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2017) Bottom-up and top-down attention for image captioning and VQA. arXiv preprint arXiv:1707.07998
Banerjee S, Lavie A (2005) Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: IEEvaluation@ACL, pp 65–72
Chen J, Zhang H, He X, Nie L, Liu W, Chua TS (2017) Attentive collaborative filtering: multimedia recommendation with item-and component-level attention. In: ACM SIGIR, ACM, pp 335–344
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In: CVPR, IEEE, pp 6298–6306
Chen X, Ma L, Jiang W, Yao J, Liu W (2018) Regularizing RNNs for caption generation by reconstructing the past with the present. arXiv preprint arXiv:1803.11439
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: ECCV, Springer, Berlin, pp 15–29
Xu N, Liu AA, Wong YK, Zhang YD, Nie WZ, Su YT, Kankanhalli M (2019) Dual-stream recurrent neural network for video captioning. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2018.2867286
Liu AA, Xu N, Zhang HW, Nie WZ, Su YT, Zhang YD (2018) Multi-level policy and reward reinforcement learning for image captioning. In: IJCAI, pp 821–827
Girshick R (2015) Fast r-cnn. In: ICCV, pp 1440–1448
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778
He X, Chua TS (2017) Neural factorization machines for sparse predictive analytics. In: ACM SIGIR, ACM, pp 355–364
He X, Deng L (2017) Deep learning for image-to-text generation: a technical overview. IEEE Signal Proc Mag 34(6):109–116
Article Google Scholar
He X, He Z, Du X, Chua TS (2018) Adversarial personalized ranking for recommendation. In: ACM SIGIR, ACM, Cambridge, pp 355–364
He X, He Z, Song J, Liu Z, Jiang YG, Chua TS (2018) NAIS: neural attentive item similarity model for recommendation. IEEE Trans Knowl Data Eng 30(12):2354–2366
Article Google Scholar
Jaderberg M, Simonyan K, Zisserman A, et al (2015) Spatial transformer networks. In: Advances in neural information processing systems, pp 2017–2025
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding long-short term memory for image caption generation. arXiv preprint arXiv:1509.04942
Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: fully convolutional localization networks for dense captioning. In: CVPR, pp 4565–4574
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp 3128–3137
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. In: NIPS
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123(1):32–73
Article MathSciNet Google Scholar
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Article Google Scholar
Li C, Chen J, Wan W, Li T (2017) Combining object-based attention and attributes for image captioning. In: ICIG, Springer, Berlin, pp 614–625
Li S, Kulkarni G, Berg TL, Berg AC, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: CNLL, association for computational linguistics, pp 220–228
Lin C (2005) Recall-oriented understudy for gisting evaluation (rouge). Retrieved 20 Aug 2005
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: ECCV, Springer, Berlin, pp 740–755
Liu AA, Nie WZ, Gao Y, Su YT (2018) View-based 3-D model retrieval: a benchmark. IEEE Trans Cybern 48(3):916–928
Google Scholar
Liu C, Sun F, Wang C, Wang F, Yuille A (2017) MAT: A multimodal attentive translator for image captioning. arXiv preprint arXiv:1702.05658
Liu F, Xiang T, Hospedales TM, Yang W, Sun C (2017) Semantic regularisation for recurrent image annotation. In: CVPR, IEEE, pp 4160–4168
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: CVPR, pp 3242–3250
Nie W, Cheng H, Su Y (2017) Modeling temporal information of mitotic for mitotic event detection. IEEE Trans Big Data 3(4):458–469
Article Google Scholar
Nie WZ, Liu AA, Gao Y, Su YT (2018) Hyper-clique graph matching and applications. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2018.2852310
Article Google Scholar
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL, association for computational linguistics, pp 311–318
Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article Google Scholar
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: CVPR, pp 1179–1195
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: CVPR, pp 4566–4575
Vinyals O, Toshev A, Bengio S, Erhan D (2017) Show and tell: lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652–663
Article Google Scholar
Wu Q, Shen C, Liu L, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: CVPR, pp 203–212
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp 2048–2057
Yang L, Tang KD, Yang J, Li LJ (2017) Dense captioning with joint inference and visual context. In: CVPR, pp 1978–1987
Yang Z, Yuan Y, Wu Y, Salakhutdinov R, Cohen WW (2016) Encode, review, and decode: reviewer module for caption generation. arXiv preprint arXiv:1605.07912
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: ICCV, pp 22–29
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: CVPR, pp 4651–4659
Zhang H, Kyaw Z, Chang SF, Chua TS (2017) Visual translation embedding network for visual relation detection. In: CVPR, pp 3107–3115
Zhang H, Kyaw Z, Yu J, Chang SF (2017) Ppr-fcn: Weakly supervised visual relation detection via parallel pairwise r-fcn. arXiv preprint arXiv:1708.01956
Zhang H, Niu Y, Chang SF (2018) Grounding referring expressions in images by variational context. In: CVPR, pp 4158–4166

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (61572356, 61772359, 61872267), the grant of Tianjin New Generation Artificial Intelligence Major Program (18ZXZNGX00150), the grant of Elite Scholar Program of Tianjin University (2019XR-0001).

Author information

Authors and Affiliations

School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China
Yuting Su, Yuqian Li, Ning Xu & An-An Liu

Authors

Yuting Su
View author publications
You can also search for this author in PubMed Google Scholar
Yuqian Li
View author publications
You can also search for this author in PubMed Google Scholar
Ning Xu
View author publications
You can also search for this author in PubMed Google Scholar
An-An Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Ning Xu or An-An Liu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Su, Y., Li, Y., Xu, N. et al. Hierarchical Deep Neural Network for Image Captioning. Neural Process Lett 52, 1057–1067 (2020). https://doi.org/10.1007/s11063-019-09997-5

Download citation

Published: 18 February 2019
Issue Date: October 2020
DOI: https://doi.org/10.1007/s11063-019-09997-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical Deep Neural Network for Image Captioning

Abstract

Access this article

Similar content being viewed by others

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning to Prompt for Vision-Language Models

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hierarchical Deep Neural Network for Image Captioning

Abstract

Access this article

Similar content being viewed by others

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning to Prompt for Vision-Language Models

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation