PAIC: Parallelised Attentive Image Captioning

Wang, Ziwei; Huang, Zi; Luo, Yadan

doi:10.1007/978-3-030-39469-1_2

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12008))

Included in the following conference series:

Australasian Database Conference

1129 Accesses

Abstract

Most encoder-decoder architectures generate the image description sentence based on the recurrent neural networks (RNN). However, the RNN decoder trained by Back Propagation Through Time (BPTT) is inherently time-consuming, accompanied by the gradient vanishing problem. To overcome these difficulties, we propose a novel Parallelised Attentive Image Captioning Model (PAIC) that purely employs the optimised attention mechanism to decode natural sentences without using RNNs. At each decoding phase, our model can precisely localise different areas of image utilising the well-defined spatial attention module, meanwhile capturing the word sequence powered by the well-attested multi-head self-attention model. In contrast to the RNNs, the proposed PAIC can efficiently exploit the parallel computation advantages of GPU hardware for training, and further facilitate the gradient propagation. Extensive experiments on MS-COCO demonstrate that the proposed PAIC significantly reduces the training time, while achieving competitive performance compared to conventional RNN-based models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

GlosysIC Framework: Transformer for Image Captioning with Sequential Attention

Recurrent Fusion Network for Image Captioning

Exploring Memory and Time Efficient Neural Networks for Image Captioning

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)
Google Scholar
Bin, Y., Yang, Y., Zhou, J., Huang, Z., Shen, H.T.: Adaptively attending to visual attributes and linguistic knowledge for captioning. In: ACM MM, pp. 1345–1353 (2017)
Google Scholar
Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. CoRR abs/1504.00325 (2015)
Google Scholar
Cheng, J., Dong, L., Lapata, M.: Long short-term memory-networks for machine reading. In: EMNLP, pp. 551–561 (2016)
Google Scholar
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. TPAMI 39(4), 677–691 (2017)
Article Google Scholar
Farhadi, A., et al.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_2
Chapter Google Scholar
Guo, Z., Gao, L., Song, J., Xu, X., Shao, J., Shen, H.T.: Attention-based LSTM with semantic consistency for videos captioning. In: ACM MM, pp. 357–361 (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. In: IJCAI, pp. 4188–4192 (2015)
Google Scholar
Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: fully convolutional localization networks for dense captioning. In: CVPR, pp. 4565–4574 (2016)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. TPAMI 39(4), 664–676 (2017)
Article Google Scholar
Kim, Y., Denton, C., Hoang, L., Rush, A.M.: Structured attention networks. In: ICLR (2017)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., Choi, Y.: Collective generation of natural image descriptions. In: ACL, pp. 359–368 (2012)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: CVPR, pp. 3242–3250 (2017)
Google Scholar
Luo, Y., Huang, Z., Zhang, Z., Wang, Z., Li, J., Yang, Y.: Curiosity-driven reinforcement learning for diverse visual paragraph generation. In: ACM MM, pp. 2341–2350 (2019)
Google Scholar
Luo, Y., Wang, Z., Huang, Z., Yang, Y., Zhao, C.: Coarse-to-fine annotation enrichment for semantic segmentation learning. In: CIKM, pp. 237–246 (2018)
Google Scholar
Mitchell, M., et al.: Midge: generating image descriptions from computer vision detections. In: EACL, pp. 747–756 (2012)
Google Scholar
Parikh, A.P., Täckström, O., Das, D., Uszkoreit, J.: A decomposable attention model for natural language inference. In: EMNLP, pp. 2249–2255 (2016)
Google Scholar
Pedersoli, M., Lucas, T., Schmid, C., Verbeek, J.: Areas of attention for image captioning. In: ICCV, pp. 1251–1259 (2017)
Google Scholar
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: CVPR, pp. 1179–1195 (2017)
Google Scholar
Song, J., Gao, L., Guo, Z., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with adjusted temporal attention for video captioning. In: IJCAI, pp. 2737–2743 (2017)
Google Scholar
Sukhbaatar, S., Szlam, A., Weston, J., Fergus, R.: End-to-end memory networks. In: NeurIPS, pp. 2440–2448 (2015)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 6000–6010 (2017)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)
Google Scholar
Wang, Z., Luo, Y., Li, Y., Huang, Z., Yin, H.: Look deeper see richer: depth-aware image paragraph captioning. In: ACM MM, pp. 672–680 (2018)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)
Google Scholar
Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: ICCV, pp. 4904–4912 (2017)
Google Scholar
Zhang, M., Yang, Y., Zhang, H., Ji, Y., Xie, N., Shen, H.T.: Deep semantic indexing using convolutional localization network with region-based visual attention for image database. In: Huang, Z., Xiao, X., Cao, X. (eds.) ADC 2017. LNCS, vol. 10538, pp. 261–272. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68155-9_20
Chapter Google Scholar

Download references

Acknowledgement

This work is partially supported by ARC DP190102353 and ARC DP170103954.

Author information

Authors and Affiliations

School of Information Techonology and Electrical Engineering, The University of Queensland, Brisbane, Australia
Ziwei Wang, Zi Huang & Yadan Luo

Authors

Ziwei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zi Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yadan Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ziwei Wang .

Editor information

Editors and Affiliations

University of Melbourne, Parkville, Australia
Renata Borovica-Gajic
School of Computing and Information Systems, University of Melbourne, Parkville, VIC, Australia
Jianzhong Qi
Monash University, Clayton, Australia
Weiqing Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Z., Huang, Z., Luo, Y. (2020). PAIC: Parallelised Attentive Image Captioning. In: Borovica-Gajic, R., Qi, J., Wang, W. (eds) Databases Theory and Applications. ADC 2020. Lecture Notes in Computer Science(), vol 12008. Springer, Cham. https://doi.org/10.1007/978-3-030-39469-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-39469-1_2
Published: 21 January 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39468-4
Online ISBN: 978-3-030-39469-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics