Boosting image caption generation with feature fusion module

Xia, Pengfei; He, Jingsong; Yin, Jin

doi:10.1007/s11042-020-09110-2

Boosting image caption generation with feature fusion module

Published: 18 June 2020

Volume 79, pages 24225–24239, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

552 Accesses
11 Citations
Explore all metrics

Abstract

Image caption generation has been considered as a key issue on vision-to-language tasks. Using the classification model, such as AlexNet, VGG and ResNet as the encoder to extract image features is very common in previous work. However, there is an explicit gap in image feature requirements between caption task and classification task, and has not been widely concerned. In this paper, we propose a novel custom structure, named feature fusion module (FFM), to make the features extracted by the encoder more suitable for caption task. We evaluate the proposed module with two typical models, NIC (Neural Image Caption) and SA (Soft Attention), on two popular benchmarks, MS COCO and Flickr30k. It is consistently observed that FFM is able to boost the performance, and outperforms state-of-the-art methods over five metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image captioning model using attention and object features to mimic human image understanding

Article Open access 14 February 2022

Bangla Image Caption Generation Through CNN-Transformer Based Encoder-Decoder Network

Automatic Image Caption Generation Using ResNet & Torch Vision

Notes

http://github.com/tylin/coco-caption

References

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, vol 3, 6
Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5561–5570
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
Bahdanau D, Cho K, Bengio Y Neural machine translation by jointly learning to align and translate, arXiv:1409.0473
Bai S, An S A survey on automatic image caption generation, Neurocomputing
Banerjee S, Lavie A (2005) Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Bernardi R, Cakici R, Elliott D, Erdem A, Erdem E, Ikizler-Cinbis N, Keller F, Muscat A, Plank B (2016) Automatic description generation from images: a survey of models, datasets, and evaluation measures. J Artif Intell Res 55:409–442
Article Google Scholar
Chen C, Mu S, Xiao W, Ye Z, Wu L, Ju Q (2019) Improving image captioning with conditional generative adversarial nets 33:8142–8150
Chen F, Ji R, Ji J, Sun X, Zhang B, Ge X, Wu Y, Huang F, Wang Y (2019) Variational structured semantic inference for diverse image captioning 1929–1939
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning, arXiv:1611.05594
Chen L-C, Papandreou G, Schroff F, Adam H Rethinking atrous convolution for semantic image segmentation, arXiv:1706.05587
Chen X, Fang H, Lin T, Vedantam R, Gupta S, Doll??r P, Zitnick CL Microsoft coco captions: Data collection and evaluation server, arXiv:1504.00325
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y Learning phrase representations using rnn encoder-decoder for statistical machine translation, arXiv:1406.1078
Cornia M, Baraldi L, Cucchiara R Show, control and tell: A framework for generating controllable and grounded captions, arXiv:1811.10652
Dai B, Fidler S, Urtasun R, Lin D Towards diverse and natural image descriptions via a conditional gan, arXiv:1703.06029
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L Imagenet: A large-scale hierarchical image database
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: Generating sentences from images. In: European conference on computer vision. Springer, New York, pp 15–29
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He X, Yang Y, Shi B, Bai X (2019) Vd-san: Visual-densely semantic attention network for image caption generation. Neurocomputing 328:48–55
Article Google Scholar
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res 47:853–899
Article MathSciNet MATH Google Scholar
Hossain MZ, Sohel F, Shiratuddin MF, Laga H (2019) A comprehensive survey of deep learning for image captioning. ACM Comput Surv 51(6):118
Article Google Scholar
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708
Huang L, Wang W, Xia Y, Chen J (2019) Adaptively aligned image captioning via adaptive attention time 8940–8949
Ioffe S, Szegedy C Batch normalization:, Accelerating deep network training by reducing internal covariate shift, arXiv:1502.03167
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Khademi M, Schulte O (2018) Image caption generation with hierarchical contextual visual spatial attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 1943–1951
Kinghorn P, Zhang L, Shao L (2018) A region-based image caption generator with refined descriptions. Neurocomputing 272:416–424
Article Google Scholar
Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: International Conference on Machine Learning, pp 595–603
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: Understanding and generating simple image descriptions. IEEE Trans Patt Anal Mach Intell 35(12):2891–2903
Article Google Scholar
Kuznetsova P, Ordonez V, Berg T, Choi Y (2014) Treetalk: Composition and compression of trees for image descriptions. Trans Assoc Comput Linguis 2(1):351–362
Article Google Scholar
Li L, Tang S, Deng L, Zhang Y, Tian Q (2017) Image caption with global-local attention. In: AAAI, pp 4133–4139
Li S, Tao Z, Li K, Fu Y Visual to text: Survey of image and video captioning, IEEE Transactions on Emerging Topics in Computational Intelligence 1–16
Li Z, Peng C, Yu G, Zhang X, Deng Y, Sun J Detnet: A backbone network for object detection, arXiv:1804.06215
Lin C-Y Rouge: A package for automatic evaluation of summaries, Text Summarization Branches Out
Lin M, Chen Q, Yan S Network in network, arXiv:1312.4400
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2117–2125
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: Single shot multibox detector. In: European conference on computer vision. Springer, New York, pp 21–37
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol 6, p 2
Mao J, Xu W, Yang Y, Wang J, Yuille AL Explain images with multimodal recurrent neural networks, arXiv:1410.1090
Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A, Yamaguchi K, Berg T, Stratos K, Daumé H III (2012) Midge: Generating image descriptions from computer vision detections. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, pp 747–756
Mnih V, Heess N, Graves A, et al. (2014) Recurrent models of visual attention. In: Advances in neural information processing systems, pp 2204–2212
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, pp 311–318
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A Automatic differentiation in pytorch
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7008–7024
Simonyan K, Zisserman A Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556
Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguis 2(1):207–218
Article Google Scholar
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112
Tan YH, Chan CS (2019) Phrase-based image caption generator with hierarchical lstm network. Neurocomputing 333:86–100
Article Google Scholar
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wang C, Yang H, Bartz C, Meinel C (2016) Image captioning with deep bidirectional lstms 988–997
Wang Q, Chan AB (2019) Describing like humans:, On diversity in image captioning 4195–4203
Wu Q, Shen C, Liu L, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Yang Z, Yuan Y, Wu Y, Cohen WW, Salakhutdinov RR (2016) Review networks for caption generation. In: Advances in Neural Information Processing Systems, pp 2361–2369
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: IEEE International Conference on Computer Vision, ICCV, pp 22–29
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4584–4593
Zhao D, Chang Z, Guo S (2019) A multimodal fusion approach for image captioning. Neurocomputing 329:476–485
Article Google Scholar
Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp 2881–2890
Zhu X, Li L, Liu J, Li Z, Peng H, Niu X (2018) Image captioning with triple-attention and stack parallel lstm. Neurocomputing 319:55–65
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, China
Pengfei Xia, Jingsong He & Jin Yin

Authors

Pengfei Xia
View author publications
You can also search for this author in PubMed Google Scholar
Jingsong He
View author publications
You can also search for this author in PubMed Google Scholar
Jin Yin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jingsong He.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest. This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xia, P., He, J. & Yin, J. Boosting image caption generation with feature fusion module. Multimed Tools Appl 79, 24225–24239 (2020). https://doi.org/10.1007/s11042-020-09110-2

Download citation

Received: 12 June 2019
Revised: 05 April 2020
Accepted: 27 May 2020
Published: 18 June 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s11042-020-09110-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Boosting image caption generation with feature fusion module

Abstract

Access this article

Similar content being viewed by others

Image captioning model using attention and object features to mimic human image understanding

Bangla Image Caption Generation Through CNN-Transformer Based Encoder-Decoder Network

Automatic Image Caption Generation Using ResNet & Torch Vision

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Boosting image caption generation with feature fusion module

Abstract

Access this article

Similar content being viewed by others

Image captioning model using attention and object features to mimic human image understanding

Bangla Image Caption Generation Through CNN-Transformer Based Encoder-Decoder Network

Automatic Image Caption Generation Using ResNet & Torch Vision

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation