On Text Localization in End-to-End OCR-Free Document Understanding Transformer Without Text Localization Supervision

Kim, Geewook; Yokoo, Shuhei; Seo, Sukmin; Osanai, Atsuki; Okamoto, Yamato; Baek, Youngmin

doi:10.1007/978-3-031-41498-5_16

Geewook Kim⁹,
Shuhei Yokoo¹⁰,
Sukmin Seo⁹,
Atsuki Osanai¹⁰,
Yamato Okamoto¹⁰ &
…
Youngmin Baek⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14193))

Included in the following conference series:

International Conference on Document Analysis and Recognition

1146 Accesses
4 Citations

Abstract

This paper presents a simple yet effective approach for weakly supervised text localization in end-to-end visual document understanding (VDU) models. The traditional approach in VDU is to utilize off-the-shelf OCR engines in conjunction with natural language understanding models. However, to simplify the VDU pipeline and improve efficiency, recent research has focused on OCR-free document understanding transformers. These models have a limitation in that they do not provide the location of the text to the user. To alleviate it, we propose a simple yet effective method for text localization in OCR-free models that does not require any additional supervision, such as bounding box annotations. The method is based on properties of attention mechanism, and is able to output text areas with competitive high accuracy compared to other supervised methods. The proposed method can be easily applied to most existing OCR-free models, making it an attractive solution for practitioners in the field. We validate the method through experiments on document parsing benchmarks, and the results demonstrate its effectiveness in generalizing to various camera-captured document images, such as, receipts and business cards. The implementation will be available at https://github.com/clovaai/donut.

G. Kim and S. Yokoo—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

OCR-Free Document Understanding Transformer

CREPE: Coordinate-Aware End-to-End Document Parser

Notes

References

Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., et al.: What is wrong with scene text recognition model comparisons? dataset and model analysis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4715–4723 (2019)
Google Scholar
Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character region awareness for text detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9357–9366 (2019). https://doi.org/10.1109/CVPR.2019.00959
Baek, Y., et al.: CLEval: character-level evaluation for text detection and recognition tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 564–565 (2020)
Google Scholar
Baek, Y.: Character region attention for text spotting. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374, pp. 504–521. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_30
Chapter Google Scholar
Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2seq: a language modeling framework for object detection. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=e42KbIw6Wb
Davis, B., Morse, B., Price, B., Tensmeyer, C., Wigington, C., Morariu, V.: End-to-end document recognition and understanding with dessurt. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) ECCV 2022. LNCS, vol. 13804, pp. 280–296. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-25069-9_19
Chapter Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423
Dillencourt, M.B., Samet, H., Tamminen, M.: A general approach to connected-component labeling for arbitrary image representations. J. ACM 39(2), 253–280 (1992). https://doi.org/10.1145/128749.128750
Article MathSciNet MATH Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=YicbFdNTTy
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. (10), pp. 10767–10775 (2022). https://doi.org/10.1609/aaai.v36i10.21322. https://ojs.aaai.org/index.php/AAAI/article/view/21322
Huang, W., Qiao, Yu., Tang, X.: Robust scene text detection with convolution neural network induced MSER trees. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 497–511. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_33
Chapter Google Scholar
Hwang, W., et al.: Post-OCR parsing: building simple and robust parser via bio tagging. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)
Google Scholar
Hwang, W., Lee, H., Yim, J., Kim, G., Seo, M.: Cost-effective end-to-end information extraction for semi-structured document images. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3375–3383. Association for Computational Linguistics, Punta Cana, Dominican Republic (2021). https://doi.org/10.18653/v1/2021.emnlp-main.271. https://aclanthology.org/2021.emnlp-main.271
Hwang, W., Yim, J., Park, S., Yang, S., Seo, M.: Spatial dependency parsing for semi-structured document information extraction. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 330–343. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.findings-acl.28. https://aclanthology.org/2021.findings-acl.28
Itseez: Open source computer vision library (2015). https://github.com/itseez/opencv
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. In: Workshop on Deep Learning, NIPS (2014)
Google Scholar
Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160 (2015). https://doi.org/10.1109/ICDAR.2015.7333942
Kim, G., et al.: OCR-free document understanding transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 498–517. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_29
Chapter Google Scholar
Kim, S., et al.: Deer: Detection-agnostic end-to-end recognizer for scene text spotting. arXiv preprint arXiv:2203.05122 (2022)
Lee, K., et al.: Pix2Struct: screenshot parsing as pretraining for visual language understanding (2023). https://openreview.net/forum?id=UERcQuXlwy
Liao, M., Pang, G., Huang, J., Hassner, T., Bai, X.: Mask TextSpotter v3: segmentation proposal network for robust scene text spotting. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 706–722. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_41
Chapter Google Scholar
Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: Textboxes: a fast text detector with a single deep neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1 (2017). https://doi.org/10.1609/aaai.v31i1.11196. https://ojs.aaai.org/index.php/AAAI/article/view/11196
Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., Yan, J.: FOTS: fast oriented text spotting with a unified network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5676–5685 (2018)
Google Scholar
Liu, Y., Chen, H., Shen, C., He, T., Jin, L., Wang, L.: ABCNet: real-time scene text spotting with adaptive Bezier-curve network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9809–9818 (2020)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022 (2021)
Google Scholar
Lucas, S., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R.: ICDAR 2003 robust reading competitions. In: Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings, pp. 682–687 (2003). https://doi.org/10.1109/ICDAR.2003.1227749
Park, S., et al.: CORD: a consolidated receipt dataset for post-OCR parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)
Google Scholar
Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2013)
Google Scholar
Tian, Z., Huang, W., He, T., He, P., Qiao, Yu.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_4
Chapter Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Vincent, L., Soille, P.: Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Trans. Pattern Anal. Mach. Intell. 13(6), 583–598 (1991). https://doi.org/10.1109/34.87344
Article Google Scholar
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: Proceedings of the 2011 International Conference on Computer Vision, pp. 1457–1464. ICCV 2011, IEEE Computer Society, USA (2011). https://doi.org/10.1109/ICCV.2011.6126402
Xu, Y., et al.: LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol. 1 (Long Papers), pp. 2579–2591. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.acl-long.201. https://aclanthology.org/2021.acl-long.201
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200 (2020)
Google Scholar
Xue, C., Zhang, W., Hao, Y., Lu, S., Torr, P.H.S., Bai, S.: Language matters: a weakly supervised vision-language pre-training approach for scene text detection and spotting. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 284–302. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_17
Chapter Google Scholar
Yang, Z., et al.: UniTAB: unifying text and box outputs for grounded vision-language modeling. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 521–539. Springer, Cham (2022)
Chapter Google Scholar
Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., Bai, X.: Multi-oriented text detection with fully convolutional networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4159–4167 (2016). https://doi.org/10.1109/CVPR.2016.451

Download references

Author information

Authors and Affiliations

NAVER Cloud, Seongnam-si, Republic of Korea
Geewook Kim, Sukmin Seo & Youngmin Baek
LINE, Tokyo, Japan
Shuhei Yokoo, Atsuki Osanai & Yamato Okamoto

Authors

Geewook Kim
View author publications
You can also search for this author in PubMed Google Scholar
Shuhei Yokoo
View author publications
You can also search for this author in PubMed Google Scholar
Sukmin Seo
View author publications
You can also search for this author in PubMed Google Scholar
Atsuki Osanai
View author publications
You can also search for this author in PubMed Google Scholar
Yamato Okamoto
View author publications
You can also search for this author in PubMed Google Scholar
Youngmin Baek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Geewook Kim .

Editor information

Editors and Affiliations

University of La Rochelle, La Rochelle, France
Mickael Coustaty
Autonomous University of Barcelona, Bellaterra, Spain
Alicia Fornés

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, G., Yokoo, S., Seo, S., Osanai, A., Okamoto, Y., Baek, Y. (2023). On Text Localization in End-to-End OCR-Free Document Understanding Transformer Without Text Localization Supervision. In: Coustaty, M., Fornés, A. (eds) Document Analysis and Recognition – ICDAR 2023 Workshops. ICDAR 2023. Lecture Notes in Computer Science, vol 14193. Springer, Cham. https://doi.org/10.1007/978-3-031-41498-5_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-41498-5_16
Published: 15 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41497-8
Online ISBN: 978-3-031-41498-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

On Text Localization in End-to-End OCR-Free Document Understanding Transformer Without Text Localization Supervision