An Object-Extensible Training Framework for Image Captioning

Wu, Yike; Zhang, Ying; Yuan, Xiaojie

doi:10.1007/978-3-030-88480-2_64

Yike Wu¹²,
Ying Zhang¹² &
Xiaojie Yuan¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13028))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

2874 Accesses

Abstract

Recent years have witnessed great progress in image captioning based on deep learning. However, most previous methods are limited to the original training dataset that contains only a fraction of objects in the real world. They lack the ability to describe other objects that are not in the original training dataset. In this paper, we propose an object-extensible training framework that enables a widely-used captioning paradigm to describe objects beyond the original training dataset (i.e., extended objects) by generating high-quality training data for these objects automatically. Specifically, we design a general replacement mechanism, which replaces the object (An object includes the object region in the image, and the corresponding object word in the caption) in the original training dataset with the extended object to generate new training data. The key challenge in the proposed replacement mechanism is that it should be context-aware to get the meaningful result that complies with common knowledge. We introduce the multi-modal context embedding to ensure that the generated object representation is coherent in the visual context and the generated caption is smooth and fluent in the linguistic context. Extensive experiments show that our method improves significantly over the state-of-the-art methods on the held-out MSCOCO in both automatic and human evaluation.

This research is supported by the NSFC-Xinjiang Joint Fund (No. U1903128), NSFC General Technology Joint Fund for Basic Research (No. U1836109, No. U1936206), Natural Science Foundation of Tianjin, China (No. 18ZXZNGX00110, No. 18ZXZNGX00200), and the Fundamental Research Funds for the Central Universities, Nankai University (No. 63211128).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Object-Centric Unsupervised Image Captioning

Weakly-supervised image captioning based on rich contextual information

Article 25 September 2017

Transformer based Multitask Learning for Image Captioning and Object Detection

Notes

1.
This image is not paired with a caption and easy to obtain without manual efforts.

References

Anderson, P., Fernando, B., Johnson, M., Gould, S.: Guided open vocabulary image captioning with constrained beam search. In: EMNLP, pp. 936–945 (2017)
Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)
Google Scholar
Anne Hendricks, L., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., Darrell, T.: Deep compositional captioning: describing novel object categories without paired training data. In: CVPR, pp. 1–10 (2016)
Google Scholar
Huang, L., Wang, W., Chen, J., Wei, X.Y.: Attention on attention for image captioning. In: ICCV, pp. 4634–4643 (2019)
Google Scholar
Li, Y., Yao, T., Pan, Y., Chao, H., Mei, T.: Pointing novel objects in image captioning. In: CVPR, pp. 12497–12506 (2019)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lu, J., Yang, J., Batra, D., Parikh, D.: Neural baby talk. In: CVPR, pp. 7219–7228 (2018)
Google Scholar
Mogadala, A., Bista, U., Xie, L., Rettinger, A.: Describing natural images containing novel objects with knowledge guided assitance. In: ACM Multimedia (2017)
Google Scholar
Pan, Y., Yao, T., Li, Y., Mei, T.: X-linear attention networks for image captioning. In: CVPR, pp. 10971–10980 (2020)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS, pp. 91–99 (2015)
Google Scholar
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: CVPR, pp. 7008–7024 (2017)
Google Scholar
Shi, Z., Zhou, X., Qiu, X., Zhu, X.: Improving image captioning with better use of caption. In: ACL, pp. 7454–7464 (2020)
Google Scholar
Venugopalan, S., Anne Hendricks, L., Rohrbach, M., Mooney, R., Darrell, T., Saenko, K.: Captioning images with diverse objects. In: CVPR, pp. 5753–5761 (2017)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)
Google Scholar
Wu, Y., Zhu, L., Jiang, L., Yang, Y.: Decoupled novel object captioner. In: ACM Multimedia, pp. 1029–1037 (2018)
Google Scholar
Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)
Google Scholar
Yao, T., Pan, Y., Li, Y., Mei, T.: Incorporating copying mechanism in image captioning for learning novel objects. In: CVPR, pp. 6580–6588 (2017)
Google Scholar
Zhao, S., Sharma, P., Levinboim, T., Soricut, R.: Informative image captioning with external sources of information. In: ACL, pp. 6485–6494 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science, Nankai University, Tianjin, 300350, China
Yike Wu, Ying Zhang & Xiaojie Yuan

Authors

Yike Wu
View author publications
You can also search for this author in PubMed Google Scholar
Ying Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojie Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ying Zhang .

Editor information

Editors and Affiliations

University of Michigan, Ann Arbor, MI, USA
Lu Wang
Peking University, Beijing, China
Yansong Feng
Soochow University, Suzhou, China
Yu Hong
Tianjin University, Tianjin, China
Ruifang He

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, Y., Zhang, Y., Yuan, X. (2021). An Object-Extensible Training Framework for Image Captioning. In: Wang, L., Feng, Y., Hong, Y., He, R. (eds) Natural Language Processing and Chinese Computing. NLPCC 2021. Lecture Notes in Computer Science(), vol 13028. Springer, Cham. https://doi.org/10.1007/978-3-030-88480-2_64

Download citation

DOI: https://doi.org/10.1007/978-3-030-88480-2_64
Published: 06 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88479-6
Online ISBN: 978-3-030-88480-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)