Abstract
Most encoder-decoder architectures generate the image description sentence based on the recurrent neural networks (RNN). However, the RNN decoder trained by Back Propagation Through Time (BPTT) is inherently time-consuming, accompanied by the gradient vanishing problem. To overcome these difficulties, we propose a novel Parallelised Attentive Image Captioning Model (PAIC) that purely employs the optimised attention mechanism to decode natural sentences without using RNNs. At each decoding phase, our model can precisely localise different areas of image utilising the well-defined spatial attention module, meanwhile capturing the word sequence powered by the well-attested multi-head self-attention model. In contrast to the RNNs, the proposed PAIC can efficiently exploit the parallel computation advantages of GPU hardware for training, and further facilitate the gradient propagation. Extensive experiments on MS-COCO demonstrate that the proposed PAIC significantly reduces the training time, while achieving competitive performance compared to conventional RNN-based models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)
Bin, Y., Yang, Y., Zhou, J., Huang, Z., Shen, H.T.: Adaptively attending to visual attributes and linguistic knowledge for captioning. In: ACM MM, pp. 1345–1353 (2017)
Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. CoRR abs/1504.00325 (2015)
Cheng, J., Dong, L., Lapata, M.: Long short-term memory-networks for machine reading. In: EMNLP, pp. 551–561 (2016)
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. TPAMI 39(4), 677–691 (2017)
Farhadi, A., et al.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_2
Guo, Z., Gao, L., Song, J., Xu, X., Shao, J., Shen, H.T.: Attention-based LSTM with semantic consistency for videos captioning. In: ACM MM, pp. 357–361 (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. In: IJCAI, pp. 4188–4192 (2015)
Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: fully convolutional localization networks for dense captioning. In: CVPR, pp. 4565–4574 (2016)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. TPAMI 39(4), 664–676 (2017)
Kim, Y., Denton, C., Hoang, L., Rush, A.M.: Structured attention networks. In: ICLR (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., Choi, Y.: Collective generation of natural image descriptions. In: ACL, pp. 359–368 (2012)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: CVPR, pp. 3242–3250 (2017)
Luo, Y., Huang, Z., Zhang, Z., Wang, Z., Li, J., Yang, Y.: Curiosity-driven reinforcement learning for diverse visual paragraph generation. In: ACM MM, pp. 2341–2350 (2019)
Luo, Y., Wang, Z., Huang, Z., Yang, Y., Zhao, C.: Coarse-to-fine annotation enrichment for semantic segmentation learning. In: CIKM, pp. 237–246 (2018)
Mitchell, M., et al.: Midge: generating image descriptions from computer vision detections. In: EACL, pp. 747–756 (2012)
Parikh, A.P., Täckström, O., Das, D., Uszkoreit, J.: A decomposable attention model for natural language inference. In: EMNLP, pp. 2249–2255 (2016)
Pedersoli, M., Lucas, T., Schmid, C., Verbeek, J.: Areas of attention for image captioning. In: ICCV, pp. 1251–1259 (2017)
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: CVPR, pp. 1179–1195 (2017)
Song, J., Gao, L., Guo, Z., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with adjusted temporal attention for video captioning. In: IJCAI, pp. 2737–2743 (2017)
Sukhbaatar, S., Szlam, A., Weston, J., Fergus, R.: End-to-end memory networks. In: NeurIPS, pp. 2440–2448 (2015)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 6000–6010 (2017)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)
Wang, Z., Luo, Y., Li, Y., Huang, Z., Yin, H.: Look deeper see richer: depth-aware image paragraph captioning. In: ACM MM, pp. 672–680 (2018)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)
Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: ICCV, pp. 4904–4912 (2017)
Zhang, M., Yang, Y., Zhang, H., Ji, Y., Xie, N., Shen, H.T.: Deep semantic indexing using convolutional localization network with region-based visual attention for image database. In: Huang, Z., Xiao, X., Cao, X. (eds.) ADC 2017. LNCS, vol. 10538, pp. 261–272. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68155-9_20
Acknowledgement
This work is partially supported by ARC DP190102353 and ARC DP170103954.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, Z., Huang, Z., Luo, Y. (2020). PAIC: Parallelised Attentive Image Captioning. In: Borovica-Gajic, R., Qi, J., Wang, W. (eds) Databases Theory and Applications. ADC 2020. Lecture Notes in Computer Science(), vol 12008. Springer, Cham. https://doi.org/10.1007/978-3-030-39469-1_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-39469-1_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39468-4
Online ISBN: 978-3-030-39469-1
eBook Packages: Computer ScienceComputer Science (R0)