Image Captioning Algorithm Based on Sufficient Visual Information and Text Information

Zhao, Yongqiang; Rao, Yuan; Wu, Lianwei; Feng, Cong

doi:10.1007/978-3-030-63823-8_69

Yongqiang Zhao¹¹,
Yuan Rao¹¹,
Lianwei Wu¹¹ &
…
Cong Feng¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1333))

Included in the following conference series:

International Conference on Neural Information Processing

2141 Accesses

Abstract

Most existing attention-based methods on image captioning focus on the current visual information and text information at each step to generate the next word, without considering the coherence between the visual information and the text information itself. We propose sufficient visual information (SVI) module to supplement the existing visual information contained in the network, and propose sufficient text information (STI) module to predict more text Words to supplement the text information contained in the network. Sufficient visual information module embed the attention value from the past two steps into the current attention to adapt to human visual coherence. Sufficient text information module can predict the next three words in one step, and jointly use their probabilities for inference. Finally, this paper combines these two modules to form an image captioning algorithm based on sufficient visual information and text information model (SVITI) to further integrate existing visual information and future text information in the network, thereby improving the image captioning performance of the model. These three methods are used in the classic image captioning algorithm, and have achieved achieve significant performance improvement compared to the latest method on the MS COCO dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Yin, G., Sheng, L., Liu, B., et al.: Context and attribute grounded dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6241–6250 (2019)
Google Scholar
Yang, X., Tang, K., Zhang, H., et al.: Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10685–10694 (2019)
Google Scholar
Anderson, P., He, X., Buehler, C., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Xu, Y., Wu, B., Shen, F., et al.: Exact adversarial attack to image captioning via structured output learning with latent variables. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4135–4144 (2019)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)
Google Scholar
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (M-RNN). In: ICLR (2015)
Google Scholar
Hendricks, L.A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., Darrell, T.: Deep compositional captioning: describing novel object categories without paired training data. In: CVPR, pp. 1–10 (2016)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)
Google Scholar
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCA 123(1), 32–73 (2017)
MathSciNet Google Scholar
Anderson, P., He, X., Buehler, C., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6077–6086 (2018)
Google Scholar
Bengio, S., Vinyals, O., Jaitly, N., et al.: Scheduled sampling for sequence prediction with recurrent neural networks. In: Advances in Neural Information Processing Systems, pp. 1171–1179 (2015)
Google Scholar
Chen, X., Ma, L., Jiang, W., et al.: Regularizing RNNs for caption generation by reconstructing the past with the present. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7995–8003 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Xi’an Jiaotong University, Xi’an, 710049, China
Yongqiang Zhao, Yuan Rao, Lianwei Wu & Cong Feng

Authors

Yongqiang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Rao
View author publications
You can also search for this author in PubMed Google Scholar
Lianwei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Cong Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuan Rao .

Editor information

Editors and Affiliations

Department of AI, Ping An Life, Shenzhen, China
Haiqin Yang
Faculty of Information Technology, King Mongkut’s Institute of Technology Ladkrabang, Bangkok, Thailand
Kitsuchart Pasupa
City University of Hong Kong, Kowloon, Hong Kong
Andrew Chi-Sing Leung
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, Hong Kong
James T. Kwok
School of Information Technology, King Mongkut’s University of Technology Thonburi, Bangkok, Thailand
Jonathan H. Chan
The Chinese University of Hong Kong, New Territories, Hong Kong
Irwin King

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, Y., Rao, Y., Wu, L., Feng, C. (2020). Image Captioning Algorithm Based on Sufficient Visual Information and Text Information. In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Communications in Computer and Information Science, vol 1333. Springer, Cham. https://doi.org/10.1007/978-3-030-63823-8_69

Download citation

DOI: https://doi.org/10.1007/978-3-030-63823-8_69
Published: 17 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63822-1
Online ISBN: 978-3-030-63823-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics