The steady momentum of innovations has convincingly demonstrated the high capability of attention mechanisms for the sequence to sequence learning. Nevertheless, the computation of attention across a sequence is often independent in either hard or soft mode, thereby resulting in undesired effects such as repeated modeling. In this paper, we introduce a new design to holistically explore the interdependencies between attention histories and locally emphasize the strong focus of each attention on image captioning. Specifically, we present a contextual and selective attention network (namely CoSA-Net) that novelly memorizes contextual attention and brings out the principal components from each attention. Technically, CoSA-Net writes/updates the attended image region features into memory and reads from memory when measuring attention in the next time step to leverage contextual knowledge. Only the regions with the top-k highest attention scores are selected, and each region feature is individually employed to compute an output distribution. The final output is an attention-weighted mixture of all k distributions. In turn, the attention is then upgraded by the posterior distribution conditioned on the output. Our CoSA-Net is appealing given that it is pluggable to the sentence decoder in any neural captioning model. Extensive experiments on the COCO image captioning dataset demonstrate the superiority of CoSA-Net. More remarkably, integrating CoSA-Net to a one-layer long short-term memory (LSTM) decoder increases CIDEr-D performance from 125.2% to 128.5% on the COCO Karpathy test split. When further endowing a two-layer LSTM decoder with CoSA-Net, the CIDEr-D score is boosted to 129.5%.
This work was supported by National Key Research and Development Program of China (Grant No. 2018AAA0102002) and National Natural Science Foundation of China (Grant No. 61732007).
Wang, J., Li, Y., Pan, Y. et al. Contextual and selective attention networks for image captioning. Sci. China Inf. Sci. 65, 222103 (2022). https://doi.org/10.1007/s11432-020-3523-6
