Improving the Performance of Image Captioning Models Trained on Small Datasets

du Plessis, Mikkel; Brink, Willie

doi:10.1007/978-3-030-95070-5_6

Mikkel du Plessis⁹ &
Willie Brink⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1551))

Included in the following conference series:

Southern African Conference for Artificial Intelligence Research

999 Accesses

Abstract

Recent work in image captioning seems to be driven by increasingly large amounts of training data, and requires considerable computing power for training. We propose and investigate a number of adjustments to state-of-the-art approaches, with an aim to train a performant image captioning model in under two hours on a single consumer-level GPU using only a few thousand images. Firstly, we address the issue of sparse object and scene representation in a small dataset by combining visual attention regions at various levels of granularity. Secondly, we suppress semantically unlikely caption candidates through the introduction of language model rescoring during inference. Thirdly, in order to increase vocabulary and expressiveness, we propose an augmentation of the set of training captions through the use of a paraphrase generator. State-of-the-art performance on the Flickr8k test set is achieved, across a number of evaluation metrics. The proposed model also attains competitive test scores compared to existing models trained on a much larger dataset. The findings of this paper can inspire solutions to other vision-and-language tasks where labelled data is scarce.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Survey on Automatic Image Captioning Approaches: Contemporary Trends and Future Perspectives

Article 16 October 2024

Learning Visual Representations with Caption Annotations

Boosted Attention: Leveraging Human Attention for Image Captioning

References

Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Chapter Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)
Article MathSciNet Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of Machine Learning Research, vol. 37, pp. 2048–2057 (2015)
Google Scholar
Kiros, R., Salakhutdinov R., Zemel, R.: Multimodal neural language models. In: International Conference on Machine Learning (2014)
Google Scholar
Kiros, R., Salakhutdinov, R., Zemel, R.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)
Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). In: International Conference on Machine Learning (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
Cai, W., Xiong, Z., Sun, X., Rosin, P., Jin, L., Peng, X.: Panoptic segmentation-based attention for image captioning. Appl. Sci. 10 (2020). Art. 391
Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and VQA. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Atliha, V., Šešok, D.: Text augmentation using BERT for image captioning. Appl. Sci. 10 (2020). Art. 5978
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Conference of the North American Chapter of the Association for Computational Linguistics (2019)
Google Scholar
Park, H., Kim, K., Yoon, J., Park, S., Choi, L.: Feature difference makes sense: a medical image captioning model exploiting feature difference and tag information. In: Meeting of the Association for Computational Linguistics: Student Research Workshop (2020)
Google Scholar
Dzmitry, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations (2015)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Article Google Scholar
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Conference on Neural Information Processing Systems (2015)
Google Scholar
Li, Y., et al.: Fully convolutional networks for panoptic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. Technical report, OpenAI (2019)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text Transformer. J. Mach. Learn. Res. 21, 1–67 (2020)
MathSciNet MATH Google Scholar
Zhang, Y., Baldridge, J., He, L.: PAWS: paraphrase adversaries from word scrambling. In: Conference of the North American Chapter of the Association for Computational Linguistics (2019)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: a method for automatic evaluation of machine translation. In: Annual Meeting on Association for Computational Linguistics (2002)
Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (2005)
Google Scholar
Vedantam, R., Lawrence, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
Mathur, N., Baldwin, B., Cohn, T.: Tangled up in BLEU: reevaluating the evaluation of automatic machine translation evaluation metrics. In: Meeting of the Association for Computational Linguistics (2020)
Google Scholar
Marie, B., Fujita, A., Rubino., R.: Scientific credibility of machine translation research: a meta-evaluation of 769 papers. In: Meeting of the Association for Computational Linguistics (2021)
Google Scholar
Micikevicius, P., et al.: Mixed precision training. In: International Conference on Learning Representations (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Stellenbosch University, Stellenbosch, South Africa
Mikkel du Plessis & Willie Brink

Authors

Mikkel du Plessis
View author publications
You can also search for this author in PubMed Google Scholar
Willie Brink
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Willie Brink .

Editor information

Editors and Affiliations

University of KwaZulu-Natal, Durban, South Africa
Edgar Jembere
University of Pretoria, Pretoria, South Africa
Aurona J. Gerber
University of KwaZulu-Natal, Durban, South Africa
Serestina Viriri
University of KwaZulu-Natal, Durban, South Africa
Anban Pillay

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

du Plessis, M., Brink, W. (2022). Improving the Performance of Image Captioning Models Trained on Small Datasets. In: Jembere, E., Gerber, A.J., Viriri, S., Pillay, A. (eds) Artificial Intelligence Research. SACAIR 2021. Communications in Computer and Information Science, vol 1551. Springer, Cham. https://doi.org/10.1007/978-3-030-95070-5_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-95070-5_6
Published: 29 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-95069-9
Online ISBN: 978-3-030-95070-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics