skip to main content
10.1145/3394171.3413753acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning

Published: 12 October 2020 Publication History

Abstract

OCR-based image captioning is the task of automatically describing images based on reading and understanding written text contained in images. Compared to conventional image captioning, this task is more challenging, especially when the image contains multiple text tokens and visual objects. The difficulties originate from how to make full use of the knowledge contained in the textual entities to facilitate sentence generation and how to predict a text token based on the limited information provided by the image. Such problems are not yet fully investigated in existing research. In this paper, we present a novel design - Multimodal Attention Captioner with OCR Spatial Relationship (dubbed as MMA-SR) architecture, which manages information from different modalities with a multimodal attention network and explores spatial relationships between text tokens for OCR-based image captioning. Specifically, the representations of text tokens and objects are fed into a three-layer LSTM captioner. Different attention scores for text tokens and objects are exploited through the multimodal attention network. Based on the attended features and the LSTM states, words are selected from the common vocabulary or from the image text by incorporating the learned spatial relationships between text tokens. Extensive experiments conducted on the TextCaps dataset verify the effectiveness of the proposed MMA-SR method. More remarkably, our MMA-SR increases CIDEr-D score from 93.7% to 98.0%.

Supplementary Material

MP4 File (3394171.3413753.mp4)
The presentation video of Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning. In this video (paper), we present a novel design ? Multimodal Attention Captioner with OCR Spatial Relationship (dubbed as MMA-SR) architecture, which manages information from different modalities with a multimodal attention network and explores spatial relationships between text tokens for OCR-based image captioning. Specifically, the representations of text tokens and objects are fed into a three-layer LSTM captioner. Different attention scores for text tokens and objects are exploited through the multimodal attention network. Based on the attended features and the LSTM states, words are selected from the common vocabulary or from the image text by incorporating the learned spatial relationships between text tokens.

References

[1]
Jon Almazán, Albert Gordo, Alicia Fornés, and Ernest Valveny. 2014. Word spotting and recognition with embedded attributes. IEEE transactions on pattern analysis and machine intelligence 36, 12 (2014), 2552--2566.
[2]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision. 382--398.
[3]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and VQA. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086.
[4]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65--72.
[5]
Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Minesh Mathew, CV Jawahar, Ernest Valveny, and Dimosthenis Karatzas. 2019. Icdar 2019 competition on scene text visual question answering. arXiv preprint arXiv:1907.00490 (2019).
[6]
Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. 2019. Scene text visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 4291--4301.
[7]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.
[8]
Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 71--79.
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[10]
Xiao-Yu Du, Yang Yang, Liu Yang, Fumin Shen, Zhiguang Qin, and Jinhui Tang. 2017. Captioning Videos Using Large-Scale Image Corpus. Journal of Computer Science and Technology 32 (2017), 480--493.
[11]
Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2315--2324.
[12]
Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. 2018. An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5020--5029.
[13]
Ronghang Hu, Amanpreet Singh, Trevor Darrell, and Marcus Rohrbach. 2019. Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA. arXiv preprint arXiv:1911.06258 (2019).
[14]
Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on Attention for Image Captioning. In Proceedings of the IEEE International Conference on Computer Vision. 4634--4643.
[15]
Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2016. Reading text in the wild with convolutional neural networks. International Journal of Computer Vision 116, 1 (2016), 1--20.
[16]
Wenhao Jiang, Lin Ma, Yu-Gang Jiang, Wei Liu, and Tong Zhang. 2018. Recurrent fusion network for image captioning. In Proceedings of the European Conference on Computer Vision. 499--515.
[17]
Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. Proceedings of the International Conference on Learning Representations (2015).
[18]
Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. 2019. Entangled Transformer for Image Captioning. In Proceedings of the IEEE International Conference on Computer Vision. 8928--8937.
[19]
Hui Li, Peng Wang, and Chunhua Shen. 2017. Towards end-to-end text spotting with convolutional recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 5238--5246.
[20]
Zechao Li, Jinhui Tang, and Tao Mei. 2018. Deep collaborative embedding for social image understanding. IEEE transactions on pattern analysis and machine intelligence 41, 9 (2018), 2070--2083.
[21]
Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. Textboxes: A fast text detector with a single deep neural network. In Thirty-First AAAI Conference on Artificial Intelligence.
[22]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Annual Meeting of the Association for Computational Linguistics Workshop: Text Summarization Braches Out 2004. 74--81.
[23]
Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie Yan. 2018. Fots: Fast oriented text spotting with a unified network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5676--5685.
[24]
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. 2019. OCR-VQA: Visual question answering by reading text in images. In Proceedings of the International Conference on Document Analysis and Recognition, Vol. 1. 5.
[25]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the Association for Computational Linguistics. 311--318.
[26]
Bryan A. Plummer, Arun Mallya, Christopher M. Cervantes, Julia Hockenmaier, and Svetlana Lazebnik. 2017. Phrase Localization and Visual Relationship Detection With Comprehensive Image-Language Cues. In The IEEE International Conference on Computer Vision (ICCV).
[27]
Yu Qin, Jiajun Du, Yonghua Zhang, and Hongtao Lu. 2019. Look Back and Predict Forward in Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8367--8375.
[28]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.
[29]
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008--7024.
[30]
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. 2020. TextCaps: a Dataset for Image Captioning with Reading Comprehension. arXiv preprint arXiv:2003.12462 (2020).
[31]
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8317--8326.
[32]
Ray Smith. 2007. An overview of the Tesseract OCR engine. In Ninth International Conference on Document Analysis and Recognition, Vol. 2. 629--633.
[33]
Jinhui Tang, Xiangbo Shu, Zechao Li, Yu-Gang Jiang, and Qi Tian. 2019. Social anchor-unit graph regularized tensor completion for large-scale image retagging. IEEE transactions on pattern analysis and machine intelligence 41, 8 (2019), 2027--2034.
[34]
Jinhui Tang, Xiangbo Shu, Guo-Jun Qi, Zechao Li, Meng Wang, Shuicheng Yan, and Ramesh Jain. 2016. Tri-clustered tensor completion for social-aware image tag refinement. IEEE transactions on pattern analysis and machine intelligence 39, 8 (2016), 1662--1674.
[35]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.
[36]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.
[37]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.
[38]
Jing Wang, Jianlong Fu, Jinhui Tang, Zechao Li, and Tao Mei. 2018. Show, reward and tell: Automatic generation of narrative paragraph from photo stream by adversarial training. In Thirty-Second AAAI Conference on Artificial Intelligence. 7396--7403.
[39]
Jing Wang, Yingwei Pan, Ting Yao, Jinhui Tang, and Tao Mei. 2019. Convolutional auto-encoding of sentence topics for image paragraph generation. In Proceedings of the International Joint Conference on Artificial Intelligence. 940--946.
[40]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057.
[41]
Sibei Yang, Guanbin Li, and Yizhou Yu. 2019. Cross-modal relationship inference for grounding referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4145--4154.
[42]
Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10685--10694.
[43]
Zhilin Yang, Ye Yuan, Yuexin Wu, William W Cohen, and Ruslan R Salakhutdinov. 2016. Review networks for caption generation. In Advances in Neural Information Processing Systems. 2361--2369.
[44]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision. 684--699.
[45]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2019. Hierarchy Parsing for Image Captioning. In Proceedings of the IEEE International Conference on Computer Vision.
[46]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

Cited By

View all
  • (2025)Knowledge-Guided Cross-Modal Alignment and Progressive Fusion for Chest X-Ray Report GenerationIEEE Transactions on Multimedia10.1109/TMM.2024.352172827(557-567)Online publication date: 2025
  • (2024)OCR Based Deep Learning Approach for Image Captioning2024 IEEE International Conference on Computing, Power and Communication Technologies (IC2PCT)10.1109/IC2PCT60090.2024.10486670(239-244)Online publication date: 9-Feb-2024
  • (2024)Cross-region feature fusion with geometrical relationship for OCR-based image captioningNeurocomputing10.1016/j.neucom.2024.128197601(128197)Online publication date: Oct-2024
  • Show More Cited By

Index Terms

  1. Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '20: Proceedings of the 28th ACM International Conference on Multimedia
    October 2020
    4889 pages
    ISBN:9781450379885
    DOI:10.1145/3394171
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 October 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. multimodal attention
    2. ocr spatial relationship
    3. ocr-based image captioning

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • National Key Research and Development Program of China

    Conference

    MM '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)57
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Knowledge-Guided Cross-Modal Alignment and Progressive Fusion for Chest X-Ray Report GenerationIEEE Transactions on Multimedia10.1109/TMM.2024.352172827(557-567)Online publication date: 2025
    • (2024)OCR Based Deep Learning Approach for Image Captioning2024 IEEE International Conference on Computing, Power and Communication Technologies (IC2PCT)10.1109/IC2PCT60090.2024.10486670(239-244)Online publication date: 9-Feb-2024
    • (2024)Cross-region feature fusion with geometrical relationship for OCR-based image captioningNeurocomputing10.1016/j.neucom.2024.128197601(128197)Online publication date: Oct-2024
    • (2024)Exploring coherence from heterogeneous representations for OCR image captioningMultimedia Systems10.1007/s00530-024-01470-130:5Online publication date: 6-Sep-2024
    • (2024)Knowledge Mining of Scene Text for Referring Expression ComprehensionDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70549-6_15(245-262)Online publication date: 9-Sep-2024
    • (2023)Pix2StructProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619188(18893-18912)Online publication date: 23-Jul-2023
    • (2023)Text-centric image analysis techniques:a crtical reviewJournal of Image and Graphics10.11834/jig.22096828:8(2253-2275)Online publication date: 2023
    • (2023)Cross-modal Consistency Learning with Fine-grained Fusion Network for Multimodal Fake News DetectionProceedings of the 5th ACM International Conference on Multimedia in Asia10.1145/3595916.3626397(1-7)Online publication date: 6-Dec-2023
    • (2023)Scene-text Oriented Visual Entailment: Task, Dataset and SolutionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612593(5562-5571)Online publication date: 26-Oct-2023
    • (2023)Zero-TextCap: Zero-shot Framework for Text-based Image CaptioningProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612571(4949-4957)Online publication date: 26-Oct-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media