Contextual and selective attention networks for image captioning

Wang, Jing; Li, Yehao; Pan, Yingwei; Yao, Ting; Tang, Jinhui; Mei, Tao

doi:10.1007/s11432-020-3523-6

Contextual and selective attention networks for image captioning

Research Paper
Published: 18 November 2022

Volume 65, article number 222103, (2022)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

Jing Wang¹,
Yehao Li²,
Yingwei Pan²,
Ting Yao²,
Jinhui Tang¹ &
…
Tao Mei²

263 Accesses
11 Citations
Explore all metrics

Abstract

The steady momentum of innovations has convincingly demonstrated the high capability of attention mechanisms for the sequence to sequence learning. Nevertheless, the computation of attention across a sequence is often independent in either hard or soft mode, thereby resulting in undesired effects such as repeated modeling. In this paper, we introduce a new design to holistically explore the interdependencies between attention histories and locally emphasize the strong focus of each attention on image captioning. Specifically, we present a contextual and selective attention network (namely CoSA-Net) that novelly memorizes contextual attention and brings out the principal components from each attention. Technically, CoSA-Net writes/updates the attended image region features into memory and reads from memory when measuring attention in the next time step to leverage contextual knowledge. Only the regions with the top-k highest attention scores are selected, and each region feature is individually employed to compute an output distribution. The final output is an attention-weighted mixture of all k distributions. In turn, the attention is then upgraded by the posterior distribution conditioned on the output. Our CoSA-Net is appealing given that it is pluggable to the sentence decoder in any neural captioning model. Extensive experiments on the COCO image captioning dataset demonstrate the superiority of CoSA-Net. More remarkably, integrating CoSA-Net to a one-layer long short-term memory (LSTM) decoder increases CIDEr-D performance from 125.2% to 128.5% on the COCO Karpathy test split. When further endowing a two-layer LSTM decoder with CoSA-Net, the CIDEr-D score is boosted to 129.5%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Context-Assisted Attention for Image Captioning

SACIC: A Semantics-Aware Convolutional Image Captioner Using Multi-level Pervasive Attention

Multilevel attention and relation network based image captioning model

Article 16 September 2022

References

Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and VQA. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 6077–6086
Qin Y, Du J, Zhang Y, et al. Look back and predict forward in image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 8367–8375
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 5998–6008
Xu K, Ba J, Kiros R, et al. Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the International Conference on Machine Learning, 2015. 2048–2057
Mao J, Xu W, Yang Y, et al. Explain images with multimodal recurrent neural networks. 2014. ArXiv:1410.1090
Donahue J, Hendricks L A, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 2625–2634
Vinyals O, Toshev A, Bengio S, et al. Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3156–3164
Yang Z, Yuan Y, Wu Y, et al. Review networks for caption generation. In: Proceedings of Advances in Neural Information Processing Systems, 2016. 2361–2369
You Q, Jin H, Wang Z, et al. Image captioning with semantic attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016
Liu S, Zhu Z, Ye N, et al. Optimization of image description metrics using policy gradient methods. 2016. ArXiv:1612.00370
Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell, 2016, 39: 664–676
Article Google Scholar
Fu K, Jin J, Cui R, et al. Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. IEEE Trans Pattern Anal Mach Intell, 2016, 39: 2321–2334
Article Google Scholar
Wu Q, Shen C, Wang P, et al. Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell, 2017, 40: 1367–1381
Article Google Scholar
Rennie S J, Marcheret E, Mroueh Y, et al. Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 7008–7024
Yao T, Pan Y, Li Y, et al. Boosting image captioning with attributes. In: Proceedings of the IEEE International Conference on Computer Vision, 2017. 4894–4902
Yao T, Pan Y, Li Y, et al. Exploring visual relationship for image captioning. In: Proceedings of the European Conference on Computer Vision, 2018. 684–699
Park C C, Kim B, Kim G. Towards personalized image captioning via multimodal memory networks. IEEE Trans Pattern Anal Mach Intell, 2018, 41: 999–1012
Article Google Scholar
Zha Z J, Liu D, Zhang H, et al. Context-aware visual policy network for fine-grained image captioning. IEEE Trans Pattern Anal Mach Intell, 2022, 44: 710–722
Article Google Scholar
Gao L, Li X, Song J, et al. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell, 2020, 42: 1112–1131
Google Scholar
Ji J, Xu C, Zhang X, et al. Spatio-temporal memory attention for image captioning. IEEE Trans Image Process, 2020, 29: 7615–7628
Article Google Scholar
Liu S, Ren Z, Yuan J. SibNet: sibling convolutional encoder for video captioning. IEEE Trans Pattern Anal Mach Intell, 2021, 43: 3259–3272
Article Google Scholar
Li Y, Yao T, Pan Y, et al. Contextual transformer networks for visual recognition. IEEE Trans Pattern Anal Mach Intell, 2022. doi: https://doi.org/10.1109/TPAMI.2022.3164083
Li Y, Pan Y, Yao T, et al. Comprehending and ordering semantics for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022
Li Y, Pan Y, Chen J, et al. X-modaler: a versatile and high-performance codebase for cross-modal analytics. In: Proceedings of the ACM International Conference on Multimedia, 2021. 3799–3802
Yao T, Pan Y, Li Y, et al. Hierarchy parsing for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision, 2019. 2621–2629
Jiang W, Ma L, Jiang Y G, et al. Recurrent fusion network for image captioning. In: Proceedings of the European Conference on Computer Vision, 2018. 499–515
Yang X, Tang K, Zhang H, et al. Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 10685–10694
Wang L, Bai Z, Zhang Y, et al. Show, recall, and tell: image captioning with recall mechanism. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2020. 12176–12183
Sammani F Melas-Kyriazi L. Show, edit and tell: a framework for editing image captions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 4808–4816
Lu J, Batra D, Parikh D, et al. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Proceedings of Advances in Neural Information Processing Systems, 2019
Zhou L, Palangi H, Zhang L, et al. Unified vision-language pre-training for image captioning and VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2020. 13041–13049
Li X, Yin X, Li C, et al. Oscar: object-semantics aligned pre-training for vision-language tasks. In: Proceedings of the European Conference on Computer Vision, 2020. 121–137
Zhang P, Li X, Hu X, et al. VinVL: revisiting visual representations in vision-language models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. 5579–5588
Chen J, Lian Z H, Wang Y Z, et al. Irregular scene text detection via attention guided border labeling. Sci China Inf Sci, 2019, 62: 220103
Article Google Scholar
Ye Y Y, Zhang C, Hao X L. ARPNET: attention region proposal network for 3D object detection. Sci China Inf Sci, 2019, 62: 220104
Article Google Scholar
He N J, Fang L Y, Plaza A. Hybrid first and second order attention Unet for building segmentation in remote sensing images. Sci China Inf Sci, 2020, 63: 140305
Article Google Scholar
Li Z C, Tang J H. Semi-supervised local feature selection for data classification. Sci China Inf Sci, 2021, 64: 192108
Article Google Scholar
Jin J, Fu K, Cui R, et al. Aligning where to see and what to tell: image caption with region-based attention and scene factorization. 2015. ArXiv:1506.06272
Lu J, Xiong C, Parikh D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 375–383
Pedersoli M, Lucas T, Schmid C, et al. Areas of attention for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision, 2017. 1242–1250
Wang J, Pan Y, Yao T, et al. Convolutional auto-encoding of sentence topics for image paragraph generation. In: Proceedings of the International Joint Conference on Artificial Intelligence, 2019. 940–946
Pan Y, Yao T, Li Y, et al. X-Linear attention networks for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 10971–10980
Wang J, Tang J, Yang M, et al. Improving OCR-based image captioning by incorporating geometrical relationship. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. 1306–1315
Wang J, Tang J, Luo J. Multimodal attention with image text spatial relationship for OCR-based image captioning. In: Proceedings of the ACM International Conference on Multimedia, 2021. 4337–4345
Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn, 1992, 8: 229–256
Article Google Scholar
Huang L, Wang W, Chen J, et al. Attention on attention for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision, 2019. 4634–4643
Graves A, Wayne G, Danihelka I. Neural turing machines. 2014. ArXiv:1410.5401
Weston J, Chopra S, Bordes A. Memory networks. In: Proceedings of the International Conference on Learning Representations, 2015
Graves A, Wayne G, Reynolds M, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 2016, 538: 471–476
Article Google Scholar
Sukhbaatar S, Weston J, Fergus R, et al. End-to-end memory networks. In: Proceedings of Advances in Neural Information Processing Systems, 2015
Meng F, Tu Z, Cheng Y, et al. Neural machine translation with key-value memory-augmented attention. In: Proceedings of the International Joint Conference on Artificial Intelligence, 2018. 2574–2580
Meng F, Zhang J. DTMT: a novel deep transition architecture for neural machine translation. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2019. 224–231
Kumar A, Irsoy O, Ondruska P, et al. Ask me anything: dynamic memory networks for natural language processing. In: Proceedings of the International Conference on Machine Learning, 2016. 1378–1387
Xiong C, Merity S, Socher R. Dynamic memory networks for visual and textual question answering. In: Proceedings of the International Conference on Machine Learning, 2016. 2397–2406
Zhang J, Shi X, King I, et al. Dynamic key-value memory networks for knowledge tracing. In: Proceedings of the International Conference on World Wide Web, 2017. 765–774
Chen X, Xu H, Zhang Y, et al. Sequential recommendation with user memory networks. In: Proceedings of the ACM International Conference on Web Search and Data Mining, 2018. 108–116
Yang T, Chan A B. Learning dynamic memory networks for object tracking. In: Proceedings of the European Conference on Computer Vision, 2018. 152–167
Shankar S, Garg S, Sarawagi S. Surprisingly easy hard-attention for sequence to sequence learning. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018. 640–645
Collier M, Beel J. Implementing neural Turing machines. In: Proceedings of the International Conference on Artificial Neural Networks, 2018. 94–104
Dauphin Y N, Fan A, Auli M, et al. Language modeling with gated convolutional networks. In: Proceedings of the International Conference on Machine Learning, 2017. 933–941
Shankar S, Sarawagi S. Posterior attention models for sequence to sequence learning. In: Proceedings of the International Conference on Learning Representations, 2018
Chen X, Fang H, Lin T Y, et al. Microsoft COCO captions: data collection and evaluation server. 2015. ArXiv:1504.00325
Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3128–3137
Banerjee S, Lavie A. Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005. 65–72
Vedantam R, Zitnick C L, Parikh D. CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 4566–4575
Papineni K, Roukos S, Ward T, et al. BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the Association for Computational Linguistics, 2002. 311–318
Lin C Y. ROUGE: a package for automatic evaluation of summaries. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics Workshop: Text Summarization Branches Out 2004, 2004. 74–81
Anderson P, Fernando B, Johnson M, et al. Spice: semantic propositional image caption evaluation. In: Proceedings of the European Conference on Computer Vision, 2016. 382–398
Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations, 2015

Download references

Acknowledgements

This work was supported by National Key Research and Development Program of China (Grant No. 2018AAA0102002) and National Natural Science Foundation of China (Grant No. 61732007).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
Jing Wang & Jinhui Tang
JD Explore Academy, Beijing, 100101, China
Yehao Li, Yingwei Pan, Ting Yao & Tao Mei

Authors

Jing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yehao Li
View author publications
You can also search for this author in PubMed Google Scholar
Yingwei Pan
View author publications
You can also search for this author in PubMed Google Scholar
Ting Yao
View author publications
You can also search for this author in PubMed Google Scholar
Jinhui Tang
View author publications
You can also search for this author in PubMed Google Scholar
Tao Mei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinhui Tang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, J., Li, Y., Pan, Y. et al. Contextual and selective attention networks for image captioning. Sci. China Inf. Sci. 65, 222103 (2022). https://doi.org/10.1007/s11432-020-3523-6

Download citation

Received: 16 September 2020
Revised: 27 March 2022
Accepted: 03 June 2022
Published: 18 November 2022
DOI: https://doi.org/10.1007/s11432-020-3523-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Contextual and selective attention networks for image captioning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Context-Assisted Attention for Image Captioning

SACIC: A Semantics-Aware Convolutional Image Captioner Using Multi-level Pervasive Attention

Multilevel attention and relation network based image captioning model

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Contextual and selective attention networks for image captioning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Context-Assisted Attention for Image Captioning

SACIC: A Semantics-Aware Convolutional Image Captioner Using Multi-level Pervasive Attention

Multilevel attention and relation network based image captioning model

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation