Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module

Zhu, Hegui; Wang, Ru; Zhang, Xiangde

doi:10.1007/s11063-021-10431-y

Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module

Published: 06 February 2021

Volume 53, pages 1101–1118, (2021)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

704 Accesses
7 Citations
Explore all metrics

Abstract

In the existing image captioning methods, masked convolution is usually used to generate language description, and traditional residual network (ResNets) methods used for masked convolution bring about the vanishing gradient problem. To address this issue, we propose a new image captioning framework that combines dense fusion connection (DFC) and improved stacked attention module. DFC uses dense convolutional networks (DenseNets) architecture to connect each layer to any other layer in a feed-forward fashion, then adopts ResNets method to combine features through summation. The improved stacked attention module can capture more fine-grained visual information highly relevant to the word prediction. Finally, we employ the Transformer to the image encoder to sufficiently obtain the attended image representation. The experimental results on MS-COCO dataset demonstrate the proposed model can increase CIDEr score from \(91.2 \%\) to \(106.1 \%\), which has higher performance than the comparable models and verifies the effectiveness of the proposed model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

GlosysIC Framework: Transformer for Image Captioning with Sequential Attention

Recurrent Fusion Network for Image Captioning

Layer-wise enhanced transformer with multi-modal fusion for image caption

Article 19 December 2022

References

Vinyals O, Toshev A, Bengio S, et al (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: International conference on neural information processing systems, pp 1097–1105
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.15566
He K, Zhang X, Ren S, et al(2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770-778
Hochreiter S, Schmidhuber J (1997) Long short-term memory. In: Neural computation, pp 1735–1780
Xu K, Ba J, Kiros R, et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of international conference on machine learning, pp 2048–2057
Mikolov T, Karafiat M, Burget L, et al (2010) Recurrent neural network based language model. In: 11th Annual conference of the international speech communication association, pp 1045–1048
Anderson P, He X D, Buehler C, et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. arXiv preprint arXiv:1707.07998
Chen S Z, Jin Q, Wang P, Wu Q (2020) Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition. arXiv:2003.00387
Ren S, He K, Girshick R et al (2015) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article Google Scholar
Luo FL, Zhang LP, Du B et al (2020) Dimensionality reduction with enhanced hybrid-graph discriminant learning for hyperspectral image classification. IEEE Trans Geosci Remote Sens 58(8):5336–5353
Article Google Scholar
Shi GY, Huang H, Wang LH (2020) Unsupervised dimensionality eeduction for hyperspectral imagery via local geometric structure feature learning. IEEE Geosci Remote Sens Lett 17(8):1425–1429
Article Google Scholar
Aneja J, Deshpande A, Schwing A (2017) Convolutional image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5561–5570
Wang Q, Chan AB (2018) CNN+CNN: convolutional decoders for image captioning. arXiv preprint arXiv:1805.09019
Oord A, Kalchbrenner N, Vinyals O, et al (2016) Conditional image generation with pixelcnn decoders. arXiv preprint arXiv:1606.05328
Gu J, Cai J, Wang G, et al(2018) Stack-captioning: coarse-to-fine learning for image captioning. arXiv preprint arXiv:1709.03376
Yang Z C, He X D, Gao JF, et al (2016) Stacked attention networks for image question answering. arXiv preprint arXiv:1511.02274
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: The thirty-first annual conference on neural information processing systems, pp 4–9
Zhang W, Ying Y, Lu P, et al (2020) Learning long-and short-term user literal-preference with multimodal hierarchical transformer network for personalized image caption (2020). In: The thirty-fourth AAAI conference on artificial intelligen, pp 5971–9578
Huang G, Liu Z, Weinberger KQ, et al (2017) Densely connected convolutional networks. arXiv preprint arXiv:1608.06993
Szegedy C, Liu W, Jia Y, et al(2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Karpathy A, Li F F(2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Su YT, Li YQ, Xu N et al (2019) Hierarchical deep neural network for image captioning. Neural Process Lett 52:1057–1067. https://doi.org/10.1007/s11063-019-09997-5
Article Google Scholar
Fang H, Gupta S, Iandola F, et al. (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1473–1482
Gan Z, Gan C, He X D, et al (2017) Semantic compositional networks for visual captioning, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1141–1150
Liu F, Ren X C, Liu Y X, et al (2018) SimNet: Stepwise image-topic merging network for generating detailed and comprehensive image captions. In: Computation and Language, pp 137–149
Gu J X, Joty S, Cai J F, et al (2019) Unpaired image captioning via scene graph alignments. In: International conference on computer vision, pp 10322–10331
Lu J, Yang J, Batra D, et al (2018) Neural baby talk. arXiv preprint arXiv:1803.09845
Pan B, Xu X, Shi ZW et al (2020) DSSNet: A simple dilated semantic segmentation network for hyperspectral imagery classification. IEEE Geosci Remote Sens Lett 17(11):1968–1972
Article Google Scholar
Li D, Wang Q, Kong FQ (2020) Superpixel-feature-based multiple kernel sparse representation for hyperspectral image classification. Signal Process 176:107682. https://doi.org/10.1016/j.sigpro.2020.107682
Article Google Scholar
Donahue J, Hendricks L A, Rohrbach M, et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 677–691
Chen M, Ding G, Zhao S, et al. (2017) Reference based LSTM for image captioning. In: Proceedings of the thirty-first AAAI conference on artificial intelligence, pp 3981-3987
Jia X, Gavves E, Fernando B, et al (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2407–2415
Karpathy A, Joulin A, Li F F (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems, pp 1889-1897
Cao PF, Yang ZY, Sun L et al (2019) Image captioning with bidirectional semantic attention-based guiding of long short-term memory. Neural Process Lett 50:103–119
Article Google Scholar
Lu J, Xiong C, Parikh D,et al (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3242-3250
Cornia M, Baraldi L, Cucchiara R (2019) Show, control and tell: a framework for generating controllable and grounded captions. arXiv preprint arXiv:1811.10652
Li L, Tang S, Deng L, et al (2017) Image caption with global-local attention. In: Proceedings of the national conference on artificial intelligence, pp 4133-4139
Zhu XX, Li LX, Liu J et al (2018) Captioning transformer with stacked attention modules. Appl Sci 8(5):739
Article Google Scholar
Yu J, Li J, Yu Z, et al (2019) Multimodal transformer with multi-view visual representation for image captioning. arXiv preprint arXiv:1905.07841
Lin T Y, Maire M, Belongie S et al(2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755
Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp 311–318
Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation, pp 376–380
Lin C (2004) Rouge: a package for automatic evaluation of summaries. Meeting of the association for computational linguistics, pp 74–81
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566-4575
Zhang T Y, Kishore V, Wu F, et al (2019) Bertscore: Evalbuating text generation with bert. arXiv preprint arXiv:1904.09675
Gu J X, Joty S, Cai J F, et al (2020) Improving image captioning evaluation by considering inter references variance. In: International conference on computer vision, pp 10322–10331
Mao J, Xu W, Yang Y, et al (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632
Chen L, Zhang H, Xiao J, et al (2017) SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6298–6306
Wang H Z, Wang H L, Xu K S (2019) Swell-and-shrink: decomposing image captioning by transformation and summarization. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence, pp 5226–5232
Zhou Y, Sun Y, Honavar V(2019) Improving image captioning by leveraging knowledge graphs. In 2019 IEEE winter conference on applications of computer vision, pp 283–293
Huang Y, Chen J, OUyang W et al (2020) Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Trans Image Process 29:4013–4026
Article Google Scholar

Download references

Acknowledgements

the Natural Science Foundation of Liaoning Province(No. 2020-MS-080),the Fundamental Research Funds for the Central Universities (No. N2005032), Key projects of Natural Science Foundation of Liaoning Province(No. 2017012074-301).

Author information

Authors and Affiliations

College of Sciences, Northeastern University, Shenyang, 110819, China
Hegui Zhu, Ru Wang & Xiangde Zhang

Authors

Hegui Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Ru Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiangde Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiangde Zhang.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, H., Wang, R. & Zhang, X. Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module. Neural Process Lett 53, 1101–1118 (2021). https://doi.org/10.1007/s11063-021-10431-y

Download citation

Accepted: 18 January 2021
Published: 06 February 2021
Issue Date: April 2021
DOI: https://doi.org/10.1007/s11063-021-10431-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module

Abstract

Access this article

Similar content being viewed by others

GlosysIC Framework: Transformer for Image Captioning with Sequential Attention

Recurrent Fusion Network for Image Captioning

Layer-wise enhanced transformer with multi-modal fusion for image caption

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module

Abstract

Access this article

Similar content being viewed by others

GlosysIC Framework: Transformer for Image Captioning with Sequential Attention

Recurrent Fusion Network for Image Captioning

Layer-wise enhanced transformer with multi-modal fusion for image caption

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation