Skip to main content
Log in

Contextual and selective attention networks for image captioning

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

The steady momentum of innovations has convincingly demonstrated the high capability of attention mechanisms for the sequence to sequence learning. Nevertheless, the computation of attention across a sequence is often independent in either hard or soft mode, thereby resulting in undesired effects such as repeated modeling. In this paper, we introduce a new design to holistically explore the interdependencies between attention histories and locally emphasize the strong focus of each attention on image captioning. Specifically, we present a contextual and selective attention network (namely CoSA-Net) that novelly memorizes contextual attention and brings out the principal components from each attention. Technically, CoSA-Net writes/updates the attended image region features into memory and reads from memory when measuring attention in the next time step to leverage contextual knowledge. Only the regions with the top-k highest attention scores are selected, and each region feature is individually employed to compute an output distribution. The final output is an attention-weighted mixture of all k distributions. In turn, the attention is then upgraded by the posterior distribution conditioned on the output. Our CoSA-Net is appealing given that it is pluggable to the sentence decoder in any neural captioning model. Extensive experiments on the COCO image captioning dataset demonstrate the superiority of CoSA-Net. More remarkably, integrating CoSA-Net to a one-layer long short-term memory (LSTM) decoder increases CIDEr-D performance from 125.2% to 128.5% on the COCO Karpathy test split. When further endowing a two-layer LSTM decoder with CoSA-Net, the CIDEr-D score is boosted to 129.5%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and VQA. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 6077–6086

  2. Qin Y, Du J, Zhang Y, et al. Look back and predict forward in image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 8367–8375

  3. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 5998–6008

  4. Xu K, Ba J, Kiros R, et al. Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the International Conference on Machine Learning, 2015. 2048–2057

  5. Mao J, Xu W, Yang Y, et al. Explain images with multimodal recurrent neural networks. 2014. ArXiv:1410.1090

  6. Donahue J, Hendricks L A, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 2625–2634

  7. Vinyals O, Toshev A, Bengio S, et al. Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3156–3164

  8. Yang Z, Yuan Y, Wu Y, et al. Review networks for caption generation. In: Proceedings of Advances in Neural Information Processing Systems, 2016. 2361–2369

  9. You Q, Jin H, Wang Z, et al. Image captioning with semantic attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016

  10. Liu S, Zhu Z, Ye N, et al. Optimization of image description metrics using policy gradient methods. 2016. ArXiv:1612.00370

  11. Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell, 2016, 39: 664–676

    Article  Google Scholar 

  12. Fu K, Jin J, Cui R, et al. Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. IEEE Trans Pattern Anal Mach Intell, 2016, 39: 2321–2334

    Article  Google Scholar 

  13. Wu Q, Shen C, Wang P, et al. Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell, 2017, 40: 1367–1381

    Article  Google Scholar 

  14. Rennie S J, Marcheret E, Mroueh Y, et al. Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 7008–7024

  15. Yao T, Pan Y, Li Y, et al. Boosting image captioning with attributes. In: Proceedings of the IEEE International Conference on Computer Vision, 2017. 4894–4902

  16. Yao T, Pan Y, Li Y, et al. Exploring visual relationship for image captioning. In: Proceedings of the European Conference on Computer Vision, 2018. 684–699

  17. Park C C, Kim B, Kim G. Towards personalized image captioning via multimodal memory networks. IEEE Trans Pattern Anal Mach Intell, 2018, 41: 999–1012

    Article  Google Scholar 

  18. Zha Z J, Liu D, Zhang H, et al. Context-aware visual policy network for fine-grained image captioning. IEEE Trans Pattern Anal Mach Intell, 2022, 44: 710–722

    Article  Google Scholar 

  19. Gao L, Li X, Song J, et al. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell, 2020, 42: 1112–1131

    Google Scholar 

  20. Ji J, Xu C, Zhang X, et al. Spatio-temporal memory attention for image captioning. IEEE Trans Image Process, 2020, 29: 7615–7628

    Article  Google Scholar 

  21. Liu S, Ren Z, Yuan J. SibNet: sibling convolutional encoder for video captioning. IEEE Trans Pattern Anal Mach Intell, 2021, 43: 3259–3272

    Article  Google Scholar 

  22. Li Y, Yao T, Pan Y, et al. Contextual transformer networks for visual recognition. IEEE Trans Pattern Anal Mach Intell, 2022. doi: https://doi.org/10.1109/TPAMI.2022.3164083

  23. Li Y, Pan Y, Yao T, et al. Comprehending and ordering semantics for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022

  24. Li Y, Pan Y, Chen J, et al. X-modaler: a versatile and high-performance codebase for cross-modal analytics. In: Proceedings of the ACM International Conference on Multimedia, 2021. 3799–3802

  25. Yao T, Pan Y, Li Y, et al. Hierarchy parsing for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision, 2019. 2621–2629

  26. Jiang W, Ma L, Jiang Y G, et al. Recurrent fusion network for image captioning. In: Proceedings of the European Conference on Computer Vision, 2018. 499–515

  27. Yang X, Tang K, Zhang H, et al. Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 10685–10694

  28. Wang L, Bai Z, Zhang Y, et al. Show, recall, and tell: image captioning with recall mechanism. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2020. 12176–12183

  29. Sammani F Melas-Kyriazi L. Show, edit and tell: a framework for editing image captions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 4808–4816

  30. Lu J, Batra D, Parikh D, et al. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Proceedings of Advances in Neural Information Processing Systems, 2019

  31. Zhou L, Palangi H, Zhang L, et al. Unified vision-language pre-training for image captioning and VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2020. 13041–13049

  32. Li X, Yin X, Li C, et al. Oscar: object-semantics aligned pre-training for vision-language tasks. In: Proceedings of the European Conference on Computer Vision, 2020. 121–137

  33. Zhang P, Li X, Hu X, et al. VinVL: revisiting visual representations in vision-language models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. 5579–5588

  34. Chen J, Lian Z H, Wang Y Z, et al. Irregular scene text detection via attention guided border labeling. Sci China Inf Sci, 2019, 62: 220103

    Article  Google Scholar 

  35. Ye Y Y, Zhang C, Hao X L. ARPNET: attention region proposal network for 3D object detection. Sci China Inf Sci, 2019, 62: 220104

    Article  Google Scholar 

  36. He N J, Fang L Y, Plaza A. Hybrid first and second order attention Unet for building segmentation in remote sensing images. Sci China Inf Sci, 2020, 63: 140305

    Article  Google Scholar 

  37. Li Z C, Tang J H. Semi-supervised local feature selection for data classification. Sci China Inf Sci, 2021, 64: 192108

    Article  Google Scholar 

  38. Jin J, Fu K, Cui R, et al. Aligning where to see and what to tell: image caption with region-based attention and scene factorization. 2015. ArXiv:1506.06272

  39. Lu J, Xiong C, Parikh D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 375–383

  40. Pedersoli M, Lucas T, Schmid C, et al. Areas of attention for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision, 2017. 1242–1250

  41. Wang J, Pan Y, Yao T, et al. Convolutional auto-encoding of sentence topics for image paragraph generation. In: Proceedings of the International Joint Conference on Artificial Intelligence, 2019. 940–946

  42. Pan Y, Yao T, Li Y, et al. X-Linear attention networks for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 10971–10980

  43. Wang J, Tang J, Yang M, et al. Improving OCR-based image captioning by incorporating geometrical relationship. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. 1306–1315

  44. Wang J, Tang J, Luo J. Multimodal attention with image text spatial relationship for OCR-based image captioning. In: Proceedings of the ACM International Conference on Multimedia, 2021. 4337–4345

  45. Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn, 1992, 8: 229–256

    Article  Google Scholar 

  46. Huang L, Wang W, Chen J, et al. Attention on attention for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision, 2019. 4634–4643

  47. Graves A, Wayne G, Danihelka I. Neural turing machines. 2014. ArXiv:1410.5401

  48. Weston J, Chopra S, Bordes A. Memory networks. In: Proceedings of the International Conference on Learning Representations, 2015

  49. Graves A, Wayne G, Reynolds M, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 2016, 538: 471–476

    Article  Google Scholar 

  50. Sukhbaatar S, Weston J, Fergus R, et al. End-to-end memory networks. In: Proceedings of Advances in Neural Information Processing Systems, 2015

  51. Meng F, Tu Z, Cheng Y, et al. Neural machine translation with key-value memory-augmented attention. In: Proceedings of the International Joint Conference on Artificial Intelligence, 2018. 2574–2580

  52. Meng F, Zhang J. DTMT: a novel deep transition architecture for neural machine translation. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2019. 224–231

  53. Kumar A, Irsoy O, Ondruska P, et al. Ask me anything: dynamic memory networks for natural language processing. In: Proceedings of the International Conference on Machine Learning, 2016. 1378–1387

  54. Xiong C, Merity S, Socher R. Dynamic memory networks for visual and textual question answering. In: Proceedings of the International Conference on Machine Learning, 2016. 2397–2406

  55. Zhang J, Shi X, King I, et al. Dynamic key-value memory networks for knowledge tracing. In: Proceedings of the International Conference on World Wide Web, 2017. 765–774

  56. Chen X, Xu H, Zhang Y, et al. Sequential recommendation with user memory networks. In: Proceedings of the ACM International Conference on Web Search and Data Mining, 2018. 108–116

  57. Yang T, Chan A B. Learning dynamic memory networks for object tracking. In: Proceedings of the European Conference on Computer Vision, 2018. 152–167

  58. Shankar S, Garg S, Sarawagi S. Surprisingly easy hard-attention for sequence to sequence learning. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018. 640–645

  59. Collier M, Beel J. Implementing neural Turing machines. In: Proceedings of the International Conference on Artificial Neural Networks, 2018. 94–104

  60. Dauphin Y N, Fan A, Auli M, et al. Language modeling with gated convolutional networks. In: Proceedings of the International Conference on Machine Learning, 2017. 933–941

  61. Shankar S, Sarawagi S. Posterior attention models for sequence to sequence learning. In: Proceedings of the International Conference on Learning Representations, 2018

  62. Chen X, Fang H, Lin T Y, et al. Microsoft COCO captions: data collection and evaluation server. 2015. ArXiv:1504.00325

  63. Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3128–3137

  64. Banerjee S, Lavie A. Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005. 65–72

  65. Vedantam R, Zitnick C L, Parikh D. CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 4566–4575

  66. Papineni K, Roukos S, Ward T, et al. BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the Association for Computational Linguistics, 2002. 311–318

  67. Lin C Y. ROUGE: a package for automatic evaluation of summaries. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics Workshop: Text Summarization Branches Out 2004, 2004. 74–81

  68. Anderson P, Fernando B, Johnson M, et al. Spice: semantic propositional image caption evaluation. In: Proceedings of the European Conference on Computer Vision, 2016. 382–398

  69. Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations, 2015

Download references

Acknowledgements

This work was supported by National Key Research and Development Program of China (Grant No. 2018AAA0102002) and National Natural Science Foundation of China (Grant No. 61732007).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinhui Tang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, J., Li, Y., Pan, Y. et al. Contextual and selective attention networks for image captioning. Sci. China Inf. Sci. 65, 222103 (2022). https://doi.org/10.1007/s11432-020-3523-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11432-020-3523-6

Keywords

Navigation