Abstract
The steady momentum of innovations has convincingly demonstrated the high capability of attention mechanisms for the sequence to sequence learning. Nevertheless, the computation of attention across a sequence is often independent in either hard or soft mode, thereby resulting in undesired effects such as repeated modeling. In this paper, we introduce a new design to holistically explore the interdependencies between attention histories and locally emphasize the strong focus of each attention on image captioning. Specifically, we present a contextual and selective attention network (namely CoSA-Net) that novelly memorizes contextual attention and brings out the principal components from each attention. Technically, CoSA-Net writes/updates the attended image region features into memory and reads from memory when measuring attention in the next time step to leverage contextual knowledge. Only the regions with the top-k highest attention scores are selected, and each region feature is individually employed to compute an output distribution. The final output is an attention-weighted mixture of all k distributions. In turn, the attention is then upgraded by the posterior distribution conditioned on the output. Our CoSA-Net is appealing given that it is pluggable to the sentence decoder in any neural captioning model. Extensive experiments on the COCO image captioning dataset demonstrate the superiority of CoSA-Net. More remarkably, integrating CoSA-Net to a one-layer long short-term memory (LSTM) decoder increases CIDEr-D performance from 125.2% to 128.5% on the COCO Karpathy test split. When further endowing a two-layer LSTM decoder with CoSA-Net, the CIDEr-D score is boosted to 129.5%.
Similar content being viewed by others
References
Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and VQA. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 6077–6086
Qin Y, Du J, Zhang Y, et al. Look back and predict forward in image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 8367–8375
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 5998–6008
Xu K, Ba J, Kiros R, et al. Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the International Conference on Machine Learning, 2015. 2048–2057
Mao J, Xu W, Yang Y, et al. Explain images with multimodal recurrent neural networks. 2014. ArXiv:1410.1090
Donahue J, Hendricks L A, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 2625–2634
Vinyals O, Toshev A, Bengio S, et al. Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3156–3164
Yang Z, Yuan Y, Wu Y, et al. Review networks for caption generation. In: Proceedings of Advances in Neural Information Processing Systems, 2016. 2361–2369
You Q, Jin H, Wang Z, et al. Image captioning with semantic attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016
Liu S, Zhu Z, Ye N, et al. Optimization of image description metrics using policy gradient methods. 2016. ArXiv:1612.00370
Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell, 2016, 39: 664–676
Fu K, Jin J, Cui R, et al. Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. IEEE Trans Pattern Anal Mach Intell, 2016, 39: 2321–2334
Wu Q, Shen C, Wang P, et al. Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell, 2017, 40: 1367–1381
Rennie S J, Marcheret E, Mroueh Y, et al. Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 7008–7024
Yao T, Pan Y, Li Y, et al. Boosting image captioning with attributes. In: Proceedings of the IEEE International Conference on Computer Vision, 2017. 4894–4902
Yao T, Pan Y, Li Y, et al. Exploring visual relationship for image captioning. In: Proceedings of the European Conference on Computer Vision, 2018. 684–699
Park C C, Kim B, Kim G. Towards personalized image captioning via multimodal memory networks. IEEE Trans Pattern Anal Mach Intell, 2018, 41: 999–1012
Zha Z J, Liu D, Zhang H, et al. Context-aware visual policy network for fine-grained image captioning. IEEE Trans Pattern Anal Mach Intell, 2022, 44: 710–722
Gao L, Li X, Song J, et al. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell, 2020, 42: 1112–1131
Ji J, Xu C, Zhang X, et al. Spatio-temporal memory attention for image captioning. IEEE Trans Image Process, 2020, 29: 7615–7628
Liu S, Ren Z, Yuan J. SibNet: sibling convolutional encoder for video captioning. IEEE Trans Pattern Anal Mach Intell, 2021, 43: 3259–3272
Li Y, Yao T, Pan Y, et al. Contextual transformer networks for visual recognition. IEEE Trans Pattern Anal Mach Intell, 2022. doi: https://doi.org/10.1109/TPAMI.2022.3164083
Li Y, Pan Y, Yao T, et al. Comprehending and ordering semantics for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022
Li Y, Pan Y, Chen J, et al. X-modaler: a versatile and high-performance codebase for cross-modal analytics. In: Proceedings of the ACM International Conference on Multimedia, 2021. 3799–3802
Yao T, Pan Y, Li Y, et al. Hierarchy parsing for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision, 2019. 2621–2629
Jiang W, Ma L, Jiang Y G, et al. Recurrent fusion network for image captioning. In: Proceedings of the European Conference on Computer Vision, 2018. 499–515
Yang X, Tang K, Zhang H, et al. Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 10685–10694
Wang L, Bai Z, Zhang Y, et al. Show, recall, and tell: image captioning with recall mechanism. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2020. 12176–12183
Sammani F Melas-Kyriazi L. Show, edit and tell: a framework for editing image captions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 4808–4816
Lu J, Batra D, Parikh D, et al. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Proceedings of Advances in Neural Information Processing Systems, 2019
Zhou L, Palangi H, Zhang L, et al. Unified vision-language pre-training for image captioning and VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2020. 13041–13049
Li X, Yin X, Li C, et al. Oscar: object-semantics aligned pre-training for vision-language tasks. In: Proceedings of the European Conference on Computer Vision, 2020. 121–137
Zhang P, Li X, Hu X, et al. VinVL: revisiting visual representations in vision-language models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. 5579–5588
Chen J, Lian Z H, Wang Y Z, et al. Irregular scene text detection via attention guided border labeling. Sci China Inf Sci, 2019, 62: 220103
Ye Y Y, Zhang C, Hao X L. ARPNET: attention region proposal network for 3D object detection. Sci China Inf Sci, 2019, 62: 220104
He N J, Fang L Y, Plaza A. Hybrid first and second order attention Unet for building segmentation in remote sensing images. Sci China Inf Sci, 2020, 63: 140305
Li Z C, Tang J H. Semi-supervised local feature selection for data classification. Sci China Inf Sci, 2021, 64: 192108
Jin J, Fu K, Cui R, et al. Aligning where to see and what to tell: image caption with region-based attention and scene factorization. 2015. ArXiv:1506.06272
Lu J, Xiong C, Parikh D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 375–383
Pedersoli M, Lucas T, Schmid C, et al. Areas of attention for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision, 2017. 1242–1250
Wang J, Pan Y, Yao T, et al. Convolutional auto-encoding of sentence topics for image paragraph generation. In: Proceedings of the International Joint Conference on Artificial Intelligence, 2019. 940–946
Pan Y, Yao T, Li Y, et al. X-Linear attention networks for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 10971–10980
Wang J, Tang J, Yang M, et al. Improving OCR-based image captioning by incorporating geometrical relationship. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. 1306–1315
Wang J, Tang J, Luo J. Multimodal attention with image text spatial relationship for OCR-based image captioning. In: Proceedings of the ACM International Conference on Multimedia, 2021. 4337–4345
Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn, 1992, 8: 229–256
Huang L, Wang W, Chen J, et al. Attention on attention for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision, 2019. 4634–4643
Graves A, Wayne G, Danihelka I. Neural turing machines. 2014. ArXiv:1410.5401
Weston J, Chopra S, Bordes A. Memory networks. In: Proceedings of the International Conference on Learning Representations, 2015
Graves A, Wayne G, Reynolds M, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 2016, 538: 471–476
Sukhbaatar S, Weston J, Fergus R, et al. End-to-end memory networks. In: Proceedings of Advances in Neural Information Processing Systems, 2015
Meng F, Tu Z, Cheng Y, et al. Neural machine translation with key-value memory-augmented attention. In: Proceedings of the International Joint Conference on Artificial Intelligence, 2018. 2574–2580
Meng F, Zhang J. DTMT: a novel deep transition architecture for neural machine translation. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2019. 224–231
Kumar A, Irsoy O, Ondruska P, et al. Ask me anything: dynamic memory networks for natural language processing. In: Proceedings of the International Conference on Machine Learning, 2016. 1378–1387
Xiong C, Merity S, Socher R. Dynamic memory networks for visual and textual question answering. In: Proceedings of the International Conference on Machine Learning, 2016. 2397–2406
Zhang J, Shi X, King I, et al. Dynamic key-value memory networks for knowledge tracing. In: Proceedings of the International Conference on World Wide Web, 2017. 765–774
Chen X, Xu H, Zhang Y, et al. Sequential recommendation with user memory networks. In: Proceedings of the ACM International Conference on Web Search and Data Mining, 2018. 108–116
Yang T, Chan A B. Learning dynamic memory networks for object tracking. In: Proceedings of the European Conference on Computer Vision, 2018. 152–167
Shankar S, Garg S, Sarawagi S. Surprisingly easy hard-attention for sequence to sequence learning. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018. 640–645
Collier M, Beel J. Implementing neural Turing machines. In: Proceedings of the International Conference on Artificial Neural Networks, 2018. 94–104
Dauphin Y N, Fan A, Auli M, et al. Language modeling with gated convolutional networks. In: Proceedings of the International Conference on Machine Learning, 2017. 933–941
Shankar S, Sarawagi S. Posterior attention models for sequence to sequence learning. In: Proceedings of the International Conference on Learning Representations, 2018
Chen X, Fang H, Lin T Y, et al. Microsoft COCO captions: data collection and evaluation server. 2015. ArXiv:1504.00325
Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3128–3137
Banerjee S, Lavie A. Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005. 65–72
Vedantam R, Zitnick C L, Parikh D. CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 4566–4575
Papineni K, Roukos S, Ward T, et al. BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the Association for Computational Linguistics, 2002. 311–318
Lin C Y. ROUGE: a package for automatic evaluation of summaries. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics Workshop: Text Summarization Branches Out 2004, 2004. 74–81
Anderson P, Fernando B, Johnson M, et al. Spice: semantic propositional image caption evaluation. In: Proceedings of the European Conference on Computer Vision, 2016. 382–398
Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations, 2015
Acknowledgements
This work was supported by National Key Research and Development Program of China (Grant No. 2018AAA0102002) and National Natural Science Foundation of China (Grant No. 61732007).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, J., Li, Y., Pan, Y. et al. Contextual and selective attention networks for image captioning. Sci. China Inf. Sci. 65, 222103 (2022). https://doi.org/10.1007/s11432-020-3523-6
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-020-3523-6