Skip to main content
Log in

Boosting image caption generation with feature fusion module

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Image caption generation has been considered as a key issue on vision-to-language tasks. Using the classification model, such as AlexNet, VGG and ResNet as the encoder to extract image features is very common in previous work. However, there is an explicit gap in image feature requirements between caption task and classification task, and has not been widely concerned. In this paper, we propose a novel custom structure, named feature fusion module (FFM), to make the features extracted by the encoder more suitable for caption task. We evaluate the proposed module with two typical models, NIC (Neural Image Caption) and SA (Soft Attention), on two popular benchmarks, MS COCO and Flickr30k. It is consistently observed that FFM is able to boost the performance, and outperforms state-of-the-art methods over five metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. http://github.com/tylin/coco-caption

References

  1. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, vol 3, 6

  2. Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5561–5570

  3. Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433

  4. Bahdanau D, Cho K, Bengio Y Neural machine translation by jointly learning to align and translate, arXiv:1409.0473

  5. Bai S, An S A survey on automatic image caption generation, Neurocomputing

  6. Banerjee S, Lavie A (2005) Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72

  7. Bernardi R, Cakici R, Elliott D, Erdem A, Erdem E, Ikizler-Cinbis N, Keller F, Muscat A, Plank B (2016) Automatic description generation from images: a survey of models, datasets, and evaluation measures. J Artif Intell Res 55:409–442

    Article  Google Scholar 

  8. Chen C, Mu S, Xiao W, Ye Z, Wu L, Ju Q (2019) Improving image captioning with conditional generative adversarial nets 33:8142–8150

  9. Chen F, Ji R, Ji J, Sun X, Zhang B, Ge X, Wu Y, Huang F, Wang Y (2019) Variational structured semantic inference for diverse image captioning 1929–1939

  10. Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning, arXiv:1611.05594

  11. Chen L-C, Papandreou G, Schroff F, Adam H Rethinking atrous convolution for semantic image segmentation, arXiv:1706.05587

  12. Chen X, Fang H, Lin T, Vedantam R, Gupta S, Doll??r P, Zitnick CL Microsoft coco captions: Data collection and evaluation server, arXiv:1504.00325

  13. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y Learning phrase representations using rnn encoder-decoder for statistical machine translation, arXiv:1406.1078

  14. Cornia M, Baraldi L, Cucchiara R Show, control and tell: A framework for generating controllable and grounded captions, arXiv:1811.10652

  15. Dai B, Fidler S, Urtasun R, Lin D Towards diverse and natural image descriptions via a conditional gan, arXiv:1703.06029

  16. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L Imagenet: A large-scale hierarchical image database

  17. Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: Generating sentences from images. In: European conference on computer vision. Springer, New York, pp 15–29

  18. Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639

  19. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  20. He X, Yang Y, Shi B, Bai X (2019) Vd-san: Visual-densely semantic attention network for image caption generation. Neurocomputing 328:48–55

    Article  Google Scholar 

  21. Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res 47:853–899

    Article  MathSciNet  MATH  Google Scholar 

  22. Hossain MZ, Sohel F, Shiratuddin MF, Laga H (2019) A comprehensive survey of deep learning for image captioning. ACM Comput Surv 51(6):118

    Article  Google Scholar 

  23. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141

  24. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708

  25. Huang L, Wang W, Xia Y, Chen J (2019) Adaptively aligned image captioning via adaptive attention time 8940–8949

  26. Ioffe S, Szegedy C Batch normalization:, Accelerating deep network training by reducing internal covariate shift, arXiv:1502.03167

  27. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137

  28. Khademi M, Schulte O (2018) Image caption generation with hierarchical contextual visual spatial attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 1943–1951

  29. Kinghorn P, Zhang L, Shao L (2018) A region-based image caption generator with refined descriptions. Neurocomputing 272:416–424

    Article  Google Scholar 

  30. Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: International Conference on Machine Learning, pp 595–603

  31. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  32. Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: Understanding and generating simple image descriptions. IEEE Trans Patt Anal Mach Intell 35(12):2891–2903

    Article  Google Scholar 

  33. Kuznetsova P, Ordonez V, Berg T, Choi Y (2014) Treetalk: Composition and compression of trees for image descriptions. Trans Assoc Comput Linguis 2(1):351–362

    Article  Google Scholar 

  34. Li L, Tang S, Deng L, Zhang Y, Tian Q (2017) Image caption with global-local attention. In: AAAI, pp 4133–4139

  35. Li S, Tao Z, Li K, Fu Y Visual to text: Survey of image and video captioning, IEEE Transactions on Emerging Topics in Computational Intelligence 1–16

  36. Li Z, Peng C, Yu G, Zhang X, Deng Y, Sun J Detnet: A backbone network for object detection, arXiv:1804.06215

  37. Lin C-Y Rouge: A package for automatic evaluation of summaries, Text Summarization Branches Out

  38. Lin M, Chen Q, Yan S Network in network, arXiv:1312.4400

  39. Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2117–2125

  40. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: Single shot multibox detector. In: European conference on computer vision. Springer, New York, pp 21–37

  41. Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol 6, p 2

  42. Mao J, Xu W, Yang Y, Wang J, Yuille AL Explain images with multimodal recurrent neural networks, arXiv:1410.1090

  43. Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A, Yamaguchi K, Berg T, Stratos K, Daumé H III (2012) Midge: Generating image descriptions from computer vision detections. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, pp 747–756

  44. Mnih V, Heess N, Graves A, et al. (2014) Recurrent models of visual attention. In: Advances in neural information processing systems, pp 2204–2212

  45. Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, pp 311–318

  46. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A Automatic differentiation in pytorch

  47. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7008–7024

  48. Simonyan K, Zisserman A Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556

  49. Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguis 2(1):207–218

    Article  Google Scholar 

  50. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112

  51. Tan YH, Chan CS (2019) Phrase-based image caption generator with hierarchical lstm network. Neurocomputing 333:86–100

    Article  Google Scholar 

  52. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

  53. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  54. Wang C, Yang H, Bartz C, Meinel C (2016) Image captioning with deep bidirectional lstms 988–997

  55. Wang Q, Chan AB (2019) Describing like humans:, On diversity in image captioning 4195–4203

  56. Wu Q, Shen C, Liu L, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212

  57. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057

  58. Yang Z, Yuan Y, Wu Y, Cohen WW, Salakhutdinov RR (2016) Review networks for caption generation. In: Advances in Neural Information Processing Systems, pp 2361–2369

  59. Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: IEEE International Conference on Computer Vision, ICCV, pp 22–29

  60. You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659

  61. Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4584–4593

  62. Zhao D, Chang Z, Guo S (2019) A multimodal fusion approach for image captioning. Neurocomputing 329:476–485

    Article  Google Scholar 

  63. Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp 2881–2890

  64. Zhu X, Li L, Liu J, Li Z, Peng H, Niu X (2018) Image captioning with triple-attention and stack parallel lstm. Neurocomputing 319:55–65

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jingsong He.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest. This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xia, P., He, J. & Yin, J. Boosting image caption generation with feature fusion module. Multimed Tools Appl 79, 24225–24239 (2020). https://doi.org/10.1007/s11042-020-09110-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09110-2

Keywords

Navigation