Abstract
In the existing image captioning methods, masked convolution is usually used to generate language description, and traditional residual network (ResNets) methods used for masked convolution bring about the vanishing gradient problem. To address this issue, we propose a new image captioning framework that combines dense fusion connection (DFC) and improved stacked attention module. DFC uses dense convolutional networks (DenseNets) architecture to connect each layer to any other layer in a feed-forward fashion, then adopts ResNets method to combine features through summation. The improved stacked attention module can capture more fine-grained visual information highly relevant to the word prediction. Finally, we employ the Transformer to the image encoder to sufficiently obtain the attended image representation. The experimental results on MS-COCO dataset demonstrate the proposed model can increase CIDEr score from \(91.2 \%\) to \(106.1 \%\), which has higher performance than the comparable models and verifies the effectiveness of the proposed model.
Similar content being viewed by others
References
Vinyals O, Toshev A, Bengio S, et al (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: International conference on neural information processing systems, pp 1097–1105
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.15566
He K, Zhang X, Ren S, et al(2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770-778
Hochreiter S, Schmidhuber J (1997) Long short-term memory. In: Neural computation, pp 1735–1780
Xu K, Ba J, Kiros R, et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of international conference on machine learning, pp 2048–2057
Mikolov T, Karafiat M, Burget L, et al (2010) Recurrent neural network based language model. In: 11th Annual conference of the international speech communication association, pp 1045–1048
Anderson P, He X D, Buehler C, et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. arXiv preprint arXiv:1707.07998
Chen S Z, Jin Q, Wang P, Wu Q (2020) Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition. arXiv:2003.00387
Ren S, He K, Girshick R et al (2015) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Luo FL, Zhang LP, Du B et al (2020) Dimensionality reduction with enhanced hybrid-graph discriminant learning for hyperspectral image classification. IEEE Trans Geosci Remote Sens 58(8):5336–5353
Shi GY, Huang H, Wang LH (2020) Unsupervised dimensionality eeduction for hyperspectral imagery via local geometric structure feature learning. IEEE Geosci Remote Sens Lett 17(8):1425–1429
Aneja J, Deshpande A, Schwing A (2017) Convolutional image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5561–5570
Wang Q, Chan AB (2018) CNN+CNN: convolutional decoders for image captioning. arXiv preprint arXiv:1805.09019
Oord A, Kalchbrenner N, Vinyals O, et al (2016) Conditional image generation with pixelcnn decoders. arXiv preprint arXiv:1606.05328
Gu J, Cai J, Wang G, et al(2018) Stack-captioning: coarse-to-fine learning for image captioning. arXiv preprint arXiv:1709.03376
Yang Z C, He X D, Gao JF, et al (2016) Stacked attention networks for image question answering. arXiv preprint arXiv:1511.02274
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: The thirty-first annual conference on neural information processing systems, pp 4–9
Zhang W, Ying Y, Lu P, et al (2020) Learning long-and short-term user literal-preference with multimodal hierarchical transformer network for personalized image caption (2020). In: The thirty-fourth AAAI conference on artificial intelligen, pp 5971–9578
Huang G, Liu Z, Weinberger KQ, et al (2017) Densely connected convolutional networks. arXiv preprint arXiv:1608.06993
Szegedy C, Liu W, Jia Y, et al(2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Karpathy A, Li F F(2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Su YT, Li YQ, Xu N et al (2019) Hierarchical deep neural network for image captioning. Neural Process Lett 52:1057–1067. https://doi.org/10.1007/s11063-019-09997-5
Fang H, Gupta S, Iandola F, et al. (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1473–1482
Gan Z, Gan C, He X D, et al (2017) Semantic compositional networks for visual captioning, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1141–1150
Liu F, Ren X C, Liu Y X, et al (2018) SimNet: Stepwise image-topic merging network for generating detailed and comprehensive image captions. In: Computation and Language, pp 137–149
Gu J X, Joty S, Cai J F, et al (2019) Unpaired image captioning via scene graph alignments. In: International conference on computer vision, pp 10322–10331
Lu J, Yang J, Batra D, et al (2018) Neural baby talk. arXiv preprint arXiv:1803.09845
Pan B, Xu X, Shi ZW et al (2020) DSSNet: A simple dilated semantic segmentation network for hyperspectral imagery classification. IEEE Geosci Remote Sens Lett 17(11):1968–1972
Li D, Wang Q, Kong FQ (2020) Superpixel-feature-based multiple kernel sparse representation for hyperspectral image classification. Signal Process 176:107682. https://doi.org/10.1016/j.sigpro.2020.107682
Donahue J, Hendricks L A, Rohrbach M, et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 677–691
Chen M, Ding G, Zhao S, et al. (2017) Reference based LSTM for image captioning. In: Proceedings of the thirty-first AAAI conference on artificial intelligence, pp 3981-3987
Jia X, Gavves E, Fernando B, et al (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2407–2415
Karpathy A, Joulin A, Li F F (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems, pp 1889-1897
Cao PF, Yang ZY, Sun L et al (2019) Image captioning with bidirectional semantic attention-based guiding of long short-term memory. Neural Process Lett 50:103–119
Lu J, Xiong C, Parikh D,et al (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3242-3250
Cornia M, Baraldi L, Cucchiara R (2019) Show, control and tell: a framework for generating controllable and grounded captions. arXiv preprint arXiv:1811.10652
Li L, Tang S, Deng L, et al (2017) Image caption with global-local attention. In: Proceedings of the national conference on artificial intelligence, pp 4133-4139
Zhu XX, Li LX, Liu J et al (2018) Captioning transformer with stacked attention modules. Appl Sci 8(5):739
Yu J, Li J, Yu Z, et al (2019) Multimodal transformer with multi-view visual representation for image captioning. arXiv preprint arXiv:1905.07841
Lin T Y, Maire M, Belongie S et al(2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755
Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp 311–318
Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation, pp 376–380
Lin C (2004) Rouge: a package for automatic evaluation of summaries. Meeting of the association for computational linguistics, pp 74–81
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566-4575
Zhang T Y, Kishore V, Wu F, et al (2019) Bertscore: Evalbuating text generation with bert. arXiv preprint arXiv:1904.09675
Gu J X, Joty S, Cai J F, et al (2020) Improving image captioning evaluation by considering inter references variance. In: International conference on computer vision, pp 10322–10331
Mao J, Xu W, Yang Y, et al (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632
Chen L, Zhang H, Xiao J, et al (2017) SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6298–6306
Wang H Z, Wang H L, Xu K S (2019) Swell-and-shrink: decomposing image captioning by transformation and summarization. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence, pp 5226–5232
Zhou Y, Sun Y, Honavar V(2019) Improving image captioning by leveraging knowledge graphs. In 2019 IEEE winter conference on applications of computer vision, pp 283–293
Huang Y, Chen J, OUyang W et al (2020) Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Trans Image Process 29:4013–4026
Acknowledgements
the Natural Science Foundation of Liaoning Province(No. 2020-MS-080),the Fundamental Research Funds for the Central Universities (No. N2005032), Key projects of Natural Science Foundation of Liaoning Province(No. 2017012074-301).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhu, H., Wang, R. & Zhang, X. Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module. Neural Process Lett 53, 1101–1118 (2021). https://doi.org/10.1007/s11063-021-10431-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-021-10431-y