Skip to main content
Log in

Hierarchical Deep Neural Network for Image Captioning

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Automatically describing image content with natural language is a fundamental challenge for computer vision community. General methods used visual information to generate sentences directly. However, only depending on the visual information is not enough to generate the fine-grained descriptions for given images. In this paper, we exploit the fusion of visual information and high-level semantic information for image captioning. We propose a hierarchical deep neural network, which consists of the bottom layer and the top layer. The former extracts the visual and high-level semantic information from image and detected regions, respectively, while the latter integrates both of them with adaptive attention mechanism for the caption generation. The experimental results achieve the competing performances against the state-of-the-art methods on MSCOCO dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. https://github.com/tylin/coco-caption.

References

  1. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2017) Bottom-up and top-down attention for image captioning and VQA. arXiv preprint arXiv:1707.07998

  2. Banerjee S, Lavie A (2005) Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: IEEvaluation@ACL, pp 65–72

  3. Chen J, Zhang H, He X, Nie L, Liu W, Chua TS (2017) Attentive collaborative filtering: multimedia recommendation with item-and component-level attention. In: ACM SIGIR, ACM, pp 335–344

  4. Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In: CVPR, IEEE, pp 6298–6306

  5. Chen X, Ma L, Jiang W, Yao J, Liu W (2018) Regularizing RNNs for caption generation by reconstructing the past with the present. arXiv preprint arXiv:1803.11439

  6. Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: ECCV, Springer, Berlin, pp 15–29

  7. Xu N, Liu AA, Wong YK, Zhang YD, Nie WZ, Su YT, Kankanhalli M (2019) Dual-stream recurrent neural network for video captioning. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2018.2867286

  8. Liu AA, Xu N, Zhang HW, Nie WZ, Su YT, Zhang YD (2018) Multi-level policy and reward reinforcement learning for image captioning. In: IJCAI, pp 821–827

  9. Girshick R (2015) Fast r-cnn. In: ICCV, pp 1440–1448

  10. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778

  11. He X, Chua TS (2017) Neural factorization machines for sparse predictive analytics. In: ACM SIGIR, ACM, pp 355–364

  12. He X, Deng L (2017) Deep learning for image-to-text generation: a technical overview. IEEE Signal Proc Mag 34(6):109–116

    Article  Google Scholar 

  13. He X, He Z, Du X, Chua TS (2018) Adversarial personalized ranking for recommendation. In: ACM SIGIR, ACM, Cambridge, pp 355–364

  14. He X, He Z, Song J, Liu Z, Jiang YG, Chua TS (2018) NAIS: neural attentive item similarity model for recommendation. IEEE Trans Knowl Data Eng 30(12):2354–2366

    Article  Google Scholar 

  15. Jaderberg M, Simonyan K, Zisserman A, et al (2015) Spatial transformer networks. In: Advances in neural information processing systems, pp 2017–2025

  16. Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding long-short term memory for image caption generation. arXiv preprint arXiv:1509.04942

  17. Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: fully convolutional localization networks for dense captioning. In: CVPR, pp 4565–4574

  18. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp 3128–3137

  19. Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. In: NIPS

  20. Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123(1):32–73

    Article  MathSciNet  Google Scholar 

  21. Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903

    Article  Google Scholar 

  22. Li C, Chen J, Wan W, Li T (2017) Combining object-based attention and attributes for image captioning. In: ICIG, Springer, Berlin, pp 614–625

  23. Li S, Kulkarni G, Berg TL, Berg AC, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: CNLL, association for computational linguistics, pp 220–228

  24. Lin C (2005) Recall-oriented understudy for gisting evaluation (rouge). Retrieved 20 Aug 2005

  25. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: ECCV, Springer, Berlin, pp 740–755

  26. Liu AA, Nie WZ, Gao Y, Su YT (2018) View-based 3-D model retrieval: a benchmark. IEEE Trans Cybern 48(3):916–928

    Google Scholar 

  27. Liu C, Sun F, Wang C, Wang F, Yuille A (2017) MAT: A multimodal attentive translator for image captioning. arXiv preprint arXiv:1702.05658

  28. Liu F, Xiang T, Hospedales TM, Yang W, Sun C (2017) Semantic regularisation for recurrent image annotation. In: CVPR, IEEE, pp 4160–4168

  29. Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: CVPR, pp 3242–3250

  30. Nie W, Cheng H, Su Y (2017) Modeling temporal information of mitotic for mitotic event detection. IEEE Trans Big Data 3(4):458–469

    Article  Google Scholar 

  31. Nie WZ, Liu AA, Gao Y, Su YT (2018) Hyper-clique graph matching and applications. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2018.2852310

    Article  Google Scholar 

  32. Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL, association for computational linguistics, pp 311–318

  33. Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  34. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: CVPR, pp 1179–1195

  35. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  36. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: CVPR, pp 4566–4575

  37. Vinyals O, Toshev A, Bengio S, Erhan D (2017) Show and tell: lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652–663

    Article  Google Scholar 

  38. Wu Q, Shen C, Liu L, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: CVPR, pp 203–212

  39. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp 2048–2057

  40. Yang L, Tang KD, Yang J, Li LJ (2017) Dense captioning with joint inference and visual context. In: CVPR, pp 1978–1987

  41. Yang Z, Yuan Y, Wu Y, Salakhutdinov R, Cohen WW (2016) Encode, review, and decode: reviewer module for caption generation. arXiv preprint arXiv:1605.07912

  42. Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: ICCV, pp 22–29

  43. You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: CVPR, pp 4651–4659

  44. Zhang H, Kyaw Z, Chang SF, Chua TS (2017) Visual translation embedding network for visual relation detection. In: CVPR, pp 3107–3115

  45. Zhang H, Kyaw Z, Yu J, Chang SF (2017) Ppr-fcn: Weakly supervised visual relation detection via parallel pairwise r-fcn. arXiv preprint arXiv:1708.01956

  46. Zhang H, Niu Y, Chang SF (2018) Grounding referring expressions in images by variational context. In: CVPR, pp 4158–4166

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (61572356, 61772359, 61872267), the grant of Tianjin New Generation Artificial Intelligence Major Program (18ZXZNGX00150), the grant of Elite Scholar Program of Tianjin University (2019XR-0001).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ning Xu or An-An Liu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Su, Y., Li, Y., Xu, N. et al. Hierarchical Deep Neural Network for Image Captioning. Neural Process Lett 52, 1057–1067 (2020). https://doi.org/10.1007/s11063-019-09997-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-019-09997-5

Keywords

Navigation