Abstract
Automatically describing image content with natural language is a fundamental challenge for computer vision community. General methods used visual information to generate sentences directly. However, only depending on the visual information is not enough to generate the fine-grained descriptions for given images. In this paper, we exploit the fusion of visual information and high-level semantic information for image captioning. We propose a hierarchical deep neural network, which consists of the bottom layer and the top layer. The former extracts the visual and high-level semantic information from image and detected regions, respectively, while the latter integrates both of them with adaptive attention mechanism for the caption generation. The experimental results achieve the competing performances against the state-of-the-art methods on MSCOCO dataset.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2017) Bottom-up and top-down attention for image captioning and VQA. arXiv preprint arXiv:1707.07998
Banerjee S, Lavie A (2005) Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: IEEvaluation@ACL, pp 65–72
Chen J, Zhang H, He X, Nie L, Liu W, Chua TS (2017) Attentive collaborative filtering: multimedia recommendation with item-and component-level attention. In: ACM SIGIR, ACM, pp 335–344
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In: CVPR, IEEE, pp 6298–6306
Chen X, Ma L, Jiang W, Yao J, Liu W (2018) Regularizing RNNs for caption generation by reconstructing the past with the present. arXiv preprint arXiv:1803.11439
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: ECCV, Springer, Berlin, pp 15–29
Xu N, Liu AA, Wong YK, Zhang YD, Nie WZ, Su YT, Kankanhalli M (2019) Dual-stream recurrent neural network for video captioning. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2018.2867286
Liu AA, Xu N, Zhang HW, Nie WZ, Su YT, Zhang YD (2018) Multi-level policy and reward reinforcement learning for image captioning. In: IJCAI, pp 821–827
Girshick R (2015) Fast r-cnn. In: ICCV, pp 1440–1448
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778
He X, Chua TS (2017) Neural factorization machines for sparse predictive analytics. In: ACM SIGIR, ACM, pp 355–364
He X, Deng L (2017) Deep learning for image-to-text generation: a technical overview. IEEE Signal Proc Mag 34(6):109–116
He X, He Z, Du X, Chua TS (2018) Adversarial personalized ranking for recommendation. In: ACM SIGIR, ACM, Cambridge, pp 355–364
He X, He Z, Song J, Liu Z, Jiang YG, Chua TS (2018) NAIS: neural attentive item similarity model for recommendation. IEEE Trans Knowl Data Eng 30(12):2354–2366
Jaderberg M, Simonyan K, Zisserman A, et al (2015) Spatial transformer networks. In: Advances in neural information processing systems, pp 2017–2025
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding long-short term memory for image caption generation. arXiv preprint arXiv:1509.04942
Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: fully convolutional localization networks for dense captioning. In: CVPR, pp 4565–4574
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp 3128–3137
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. In: NIPS
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123(1):32–73
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Li C, Chen J, Wan W, Li T (2017) Combining object-based attention and attributes for image captioning. In: ICIG, Springer, Berlin, pp 614–625
Li S, Kulkarni G, Berg TL, Berg AC, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: CNLL, association for computational linguistics, pp 220–228
Lin C (2005) Recall-oriented understudy for gisting evaluation (rouge). Retrieved 20 Aug 2005
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: ECCV, Springer, Berlin, pp 740–755
Liu AA, Nie WZ, Gao Y, Su YT (2018) View-based 3-D model retrieval: a benchmark. IEEE Trans Cybern 48(3):916–928
Liu C, Sun F, Wang C, Wang F, Yuille A (2017) MAT: A multimodal attentive translator for image captioning. arXiv preprint arXiv:1702.05658
Liu F, Xiang T, Hospedales TM, Yang W, Sun C (2017) Semantic regularisation for recurrent image annotation. In: CVPR, IEEE, pp 4160–4168
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: CVPR, pp 3242–3250
Nie W, Cheng H, Su Y (2017) Modeling temporal information of mitotic for mitotic event detection. IEEE Trans Big Data 3(4):458–469
Nie WZ, Liu AA, Gao Y, Su YT (2018) Hyper-clique graph matching and applications. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2018.2852310
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL, association for computational linguistics, pp 311–318
Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: CVPR, pp 1179–1195
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: CVPR, pp 4566–4575
Vinyals O, Toshev A, Bengio S, Erhan D (2017) Show and tell: lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652–663
Wu Q, Shen C, Liu L, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: CVPR, pp 203–212
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp 2048–2057
Yang L, Tang KD, Yang J, Li LJ (2017) Dense captioning with joint inference and visual context. In: CVPR, pp 1978–1987
Yang Z, Yuan Y, Wu Y, Salakhutdinov R, Cohen WW (2016) Encode, review, and decode: reviewer module for caption generation. arXiv preprint arXiv:1605.07912
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: ICCV, pp 22–29
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: CVPR, pp 4651–4659
Zhang H, Kyaw Z, Chang SF, Chua TS (2017) Visual translation embedding network for visual relation detection. In: CVPR, pp 3107–3115
Zhang H, Kyaw Z, Yu J, Chang SF (2017) Ppr-fcn: Weakly supervised visual relation detection via parallel pairwise r-fcn. arXiv preprint arXiv:1708.01956
Zhang H, Niu Y, Chang SF (2018) Grounding referring expressions in images by variational context. In: CVPR, pp 4158–4166
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (61572356, 61772359, 61872267), the grant of Tianjin New Generation Artificial Intelligence Major Program (18ZXZNGX00150), the grant of Elite Scholar Program of Tianjin University (2019XR-0001).
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Su, Y., Li, Y., Xu, N. et al. Hierarchical Deep Neural Network for Image Captioning. Neural Process Lett 52, 1057–1067 (2020). https://doi.org/10.1007/s11063-019-09997-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-019-09997-5