skip to main content
10.1145/3581783.3612571acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Zero-TextCap: Zero-shot Framework for Text-based Image Captioning

Published: 27 October 2023 Publication History

Abstract

Text-based image captioning is a vital but under-explored task, which aims to describe images by captions containing scene text automatically. Recent studies have made encouraging progress, but they are still suffering from two issues. Firstly, current models cannot capture and generate scene text in non-Latin script languages, which severely limits the objectivity and the information completeness of generated captions. Secondly, current models tend to describe images with monotonous and templated style, which greatly limits the diversity of the generated captions. Although the above-mentioned issues can be alleviated through carefully designed annotations, this process is undoubtedly laborious and time-consuming. To address the above issues, we propose a Zero-shot Framework for Text-based Image Captioning (Zero-TextCap). Concretely, to generate candidate sentences starting from the prompt 'Image of' and iteratively refine them to improve the quality and diversity of captions, we introduce a Hybrid-sampling masked language model (H-MLM). To read multi-lingual scene text and model the relationships between them, we introduce a robust OCR system. To ensure that the captions generated by H-MLM contain scene text and are highly relevant to the image, we propose a CLIP-based generation guidance module to insert OCR tokens and filter candidate sentences. Our Zero-TextCap is capable of generalizing captions containing multi-lingual scene text and boosting the diversity of captions. Sufficient experiments demonstrate the effectiveness of our proposed Zero-TextCap. Our codes are available at https://github.com/Gemhuang79/Zero_TextCap.

Supplemental Material

MP4 File
Presentation video - short version

References

[1]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer, 382--398.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077--6086.
[3]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65--72.
[4]
Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 71--79.
[5]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, Vol. 33 (2020), 1877--1901.
[6]
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10578--10587.
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://doi.org/10.18653/v1/N19-1423
[8]
Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, Lin Liang, Zhe Gan, Lijuan Wang, Yezhou Yang, and Zicheng Liu. 2022. Injecting semantic concepts into end-to-end image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18009--18019.
[9]
Dayoung Gong, Joonseok Lee, Manjin Kim, Seong Jong Ha, and Minsu Cho. 2022. Future transformer for long-term action anticipation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3052--3061.
[10]
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 7514--7528. https://aclanthology.org/2021.emnlp-main.595
[11]
Ramtin Hosseini and Pengtao Xie. 2022. Image Understanding by Captioning with Differentiable Architecture Search. In Proceedings of the 30th ACM International Conference on Multimedia. 4665--4673.
[12]
Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. 2022. Scaling up vision-language pre-training for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17980--17989.
[13]
Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF international conference on computer vision. 4634--4643.
[14]
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning. PMLR, 4904--4916.
[15]
Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. Textboxes: A fast text detector with a single deep neural network. In Proceedings of the AAAI conference on artificial intelligence, Vol. 31.
[16]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740--755.
[17]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[18]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318.
[19]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[20]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.
[21]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, Vol. 21, 1 (2020), 5485--5551.
[22]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, Vol. 28 (2015).
[23]
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. 2020. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part II 16. Springer, 742--758.
[24]
Haoyu Song, Li Dong, Weinan Zhang, Ting Liu, and Furu Wei. 2022. CLIP Models are Few-Shot Learners: Empirical Studies on VQA and Visual Entailment. In Proceedings of the 60th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Dublin, Ireland, 6088--6100. https://doi.org/10.18653/v1/2022.acl-long.421
[25]
Jinyue Su, Jiacheng Xu, Xipeng Qiu, and Xuanjing Huang. 2018. Incorporating discriminator in sentence generation: a gibbs sampling method. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[26]
Wenliang Tang, Zhenzhen Hu, Zijie Song, and Richang Hong. 2022. OCR-oriented Master Object for Text Image Captioning. In Proceedings of the 2022 International Conference on Multimedia Retrieval. 39--43.
[27]
Yoad Tewel, Yoav Shalev, Idan Schwartz, and Lior Wolf. 2022. Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17918--17928.
[28]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4566--4575.
[29]
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. Advances in neural information processing systems, Vol. 28 (2015).
[30]
Jing Wang, Jinhui Tang, and Jiebo Luo. 2020. Multimodal attention with image text spatial relationship for ocr-based image captioning. In Proceedings of the 28th ACM International Conference on Multimedia. 4337--4345.
[31]
Jing Wang, Jinhui Tang, Mingkun Yang, Xiang Bai, and Jiebo Luo. 2021b. Improving OCR-based image captioning by incorporating geometrical relationship. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1306--1315.
[32]
Jiuniu Wang, Wenjia Xu, Qingzhong Wang, and Antoni B Chan. 2022. On distinctive image captioning via comparing and reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, 2 (2022), 2088--2103.
[33]
Ning Wang, Jiahao Xie, Jihao Wu, Mingbo Jia, and Linlin Li. 2023. Controllable image captioning via prompting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 2617--2625.
[34]
Qingzhong Wang and Antoni B Chan. 2019. Describing like humans: on diversity in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4195--4203.
[35]
Zhaokai Wang, Renda Bao, Qi Wu, and Si Liu. 2021a. Confidence-aware non-repetitive multimodal transformers for textcaps. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2835--2843.
[36]
Guanghui Xu, Shuaicheng Niu, Mingkui Tan, Yucheng Luo, Qing Du, and Qi Wu. 2021. Towards accurate text-based image captioning with content diversity exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12637--12646.
[37]
Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, and Jiebo Luo. 2021. Tap: Text-aware pre-training for text-vqa and text-caption. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8751--8761.
[38]
Zequn Zeng, Hao Zhang, Ruiying Lu, Dongsheng Wang, Bo Chen, and Zhengjue Wang. 2023. Conzic: Controllable zero-shot image captioning by sampling-based polishing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23465--23476.
[39]
Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. 2022. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18123--18133.
[40]
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5579--5588.
[41]
Wenqiao Zhang, Haochen Shi, Jiannan Guo, Shengyu Zhang, Qingpeng Cai, Juncheng Li, Sihui Luo, and Yueting Zhuang. 2022. Magic: Multimodal relational graph adversarial inference for diverse and unpaired text-based image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 3335--3343.
[42]
Qi Zhu, Chenyu Gao, Peng Wang, and Qi Wu. 2021. Simple is not easy: A simple strong baseline for textvqa and textcaps. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3608--3615.

Cited By

View all
  • (2024)Depth-aware Image Captioning with Bus Dataset for Visually Impaired IndividualsProceedings of the 2024 8th International Conference on Computer Science and Artificial Intelligence10.1145/3709026.3709052(83-90)Online publication date: 6-Dec-2024
  • (2024)Would Deep Generative Models Amplify Bias in Future Models?2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01030(10833-10843)Online publication date: 16-Jun-2024
  • (2024)Exploring coherence from heterogeneous representations for OCR image captioningMultimedia Systems10.1007/s00530-024-01470-130:5Online publication date: 6-Sep-2024

Index Terms

  1. Zero-TextCap: Zero-shot Framework for Text-based Image Captioning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. diversity
    2. language bias
    3. text-based image captioning
    4. zero-shot

    Qualifiers

    • Research-article

    Funding Sources

    • Fundamental Research Funds for the Central Universities, SCUT
    • CAAI-Huawei MindSpore Open Fund and the Science and Technology Planning Project of Guangdong Province
    • Guangxi Natural Science Foundation Key Project
    • Guangxi Natural Science Foundation
    • Open Research Fund of Guangxi Key Laboratory of Multimedia Communications and Network Technology
    • Open Research Fund of Key Laboratory of Big Data and Intelligent Robot (SCUT), Ministry of Education
    • CCF-Zhipu AI Large Model Fund
    • National Natural Science Foundation of China
    • Guangxi Scientific and Technological Bases and Talents Special Projects

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)150
    • Downloads (Last 6 weeks)16
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Depth-aware Image Captioning with Bus Dataset for Visually Impaired IndividualsProceedings of the 2024 8th International Conference on Computer Science and Artificial Intelligence10.1145/3709026.3709052(83-90)Online publication date: 6-Dec-2024
    • (2024)Would Deep Generative Models Amplify Bias in Future Models?2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01030(10833-10843)Online publication date: 16-Jun-2024
    • (2024)Exploring coherence from heterogeneous representations for OCR image captioningMultimedia Systems10.1007/s00530-024-01470-130:5Online publication date: 6-Sep-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media