research-article

Zero-TextCap: Zero-shot Framework for Text-based Image Captioning

Authors:

Dongsheng Xu,

Wenye Zhao,

Yi Cai,

Qingbao HuangAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 4949 - 4957

https://doi.org/10.1145/3581783.3612571

Published: 27 October 2023 Publication History

Get Access

Abstract

Text-based image captioning is a vital but under-explored task, which aims to describe images by captions containing scene text automatically. Recent studies have made encouraging progress, but they are still suffering from two issues. Firstly, current models cannot capture and generate scene text in non-Latin script languages, which severely limits the objectivity and the information completeness of generated captions. Secondly, current models tend to describe images with monotonous and templated style, which greatly limits the diversity of the generated captions. Although the above-mentioned issues can be alleviated through carefully designed annotations, this process is undoubtedly laborious and time-consuming. To address the above issues, we propose a Zero-shot Framework for Text-based Image Captioning (Zero-TextCap). Concretely, to generate candidate sentences starting from the prompt 'Image of' and iteratively refine them to improve the quality and diversity of captions, we introduce a Hybrid-sampling masked language model (H-MLM). To read multi-lingual scene text and model the relationships between them, we introduce a robust OCR system. To ensure that the captions generated by H-MLM contain scene text and are highly relevant to the image, we propose a CLIP-based generation guidance module to insert OCR tokens and filter candidate sentences. Our Zero-TextCap is capable of generalizing captions containing multi-lingual scene text and boosting the diversity of captions. Sufficient experiments demonstrate the effectiveness of our proposed Zero-TextCap. Our codes are available at https://github.com/Gemhuang79/Zero_TextCap.

Supplemental Material

MP4 File

Presentation video - short version

Download
487.24 MB

References

[1]

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer, 382--398.

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Improving Zero-Shot Image Captioning Efficiency with Metropolis-Hastings

Cross-modal prototype learning for zero-shot handwritten character recognition

Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations