Skip to main content

On Text Localization in End-to-End OCR-Free Document Understanding Transformer Without Text Localization Supervision

  • Conference paper
  • First Online:
Document Analysis and Recognition – ICDAR 2023 Workshops (ICDAR 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14193))

Included in the following conference series:

Abstract

This paper presents a simple yet effective approach for weakly supervised text localization in end-to-end visual document understanding (VDU) models. The traditional approach in VDU is to utilize off-the-shelf OCR engines in conjunction with natural language understanding models. However, to simplify the VDU pipeline and improve efficiency, recent research has focused on OCR-free document understanding transformers. These models have a limitation in that they do not provide the location of the text to the user. To alleviate it, we propose a simple yet effective method for text localization in OCR-free models that does not require any additional supervision, such as bounding box annotations. The method is based on properties of attention mechanism, and is able to output text areas with competitive high accuracy compared to other supervised methods. The proposed method can be easily applied to most existing OCR-free models, making it an attractive solution for practitioners in the field. We validate the method through experiments on document parsing benchmarks, and the results demonstrate its effectiveness in generalizing to various camera-captured document images, such as, receipts and business cards. The implementation will be available at https://github.com/clovaai/donut.

G. Kim and S. Yokoo—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://clova.ai/ocr.

  2. 2.

    https://github.com/clovaai/donut.

  3. 3.

    https://clova.ai/ocr.

References

  1. Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., et al.: What is wrong with scene text recognition model comparisons? dataset and model analysis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4715–4723 (2019)

    Google Scholar 

  2. Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character region awareness for text detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9357–9366 (2019). https://doi.org/10.1109/CVPR.2019.00959

  3. Baek, Y., et al.: CLEval: character-level evaluation for text detection and recognition tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 564–565 (2020)

    Google Scholar 

  4. Baek, Y.: Character region attention for text spotting. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374, pp. 504–521. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_30

    Chapter  Google Scholar 

  5. Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2seq: a language modeling framework for object detection. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=e42KbIw6Wb

  6. Davis, B., Morse, B., Price, B., Tensmeyer, C., Wigington, C., Morariu, V.: End-to-end document recognition and understanding with dessurt. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) ECCV 2022. LNCS, vol. 13804, pp. 280–296. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-25069-9_19

    Chapter  Google Scholar 

  7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423

  8. Dillencourt, M.B., Samet, H., Tamminen, M.: A general approach to connected-component labeling for arbitrary image representations. J. ACM 39(2), 253–280 (1992). https://doi.org/10.1145/128749.128750

    Article  MathSciNet  MATH  Google Scholar 

  9. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=YicbFdNTTy

  10. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  11. Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. (10), pp. 10767–10775 (2022). https://doi.org/10.1609/aaai.v36i10.21322. https://ojs.aaai.org/index.php/AAAI/article/view/21322

  12. Huang, W., Qiao, Yu., Tang, X.: Robust scene text detection with convolution neural network induced MSER trees. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 497–511. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_33

    Chapter  Google Scholar 

  13. Hwang, W., et al.: Post-OCR parsing: building simple and robust parser via bio tagging. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)

    Google Scholar 

  14. Hwang, W., Lee, H., Yim, J., Kim, G., Seo, M.: Cost-effective end-to-end information extraction for semi-structured document images. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3375–3383. Association for Computational Linguistics, Punta Cana, Dominican Republic (2021). https://doi.org/10.18653/v1/2021.emnlp-main.271. https://aclanthology.org/2021.emnlp-main.271

  15. Hwang, W., Yim, J., Park, S., Yang, S., Seo, M.: Spatial dependency parsing for semi-structured document information extraction. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 330–343. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.findings-acl.28. https://aclanthology.org/2021.findings-acl.28

  16. Itseez: Open source computer vision library (2015). https://github.com/itseez/opencv

  17. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. In: Workshop on Deep Learning, NIPS (2014)

    Google Scholar 

  18. Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160 (2015). https://doi.org/10.1109/ICDAR.2015.7333942

  19. Kim, G., et al.: OCR-free document understanding transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 498–517. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_29

    Chapter  Google Scholar 

  20. Kim, S., et al.: Deer: Detection-agnostic end-to-end recognizer for scene text spotting. arXiv preprint arXiv:2203.05122 (2022)

  21. Lee, K., et al.: Pix2Struct: screenshot parsing as pretraining for visual language understanding (2023). https://openreview.net/forum?id=UERcQuXlwy

  22. Liao, M., Pang, G., Huang, J., Hassner, T., Bai, X.: Mask TextSpotter v3: segmentation proposal network for robust scene text spotting. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 706–722. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_41

    Chapter  Google Scholar 

  23. Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: Textboxes: a fast text detector with a single deep neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1 (2017). https://doi.org/10.1609/aaai.v31i1.11196. https://ojs.aaai.org/index.php/AAAI/article/view/11196

  24. Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., Yan, J.: FOTS: fast oriented text spotting with a unified network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5676–5685 (2018)

    Google Scholar 

  25. Liu, Y., Chen, H., Shen, C., He, T., Jin, L., Wang, L.: ABCNet: real-time scene text spotting with adaptive Bezier-curve network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9809–9818 (2020)

    Google Scholar 

  26. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022 (2021)

    Google Scholar 

  27. Lucas, S., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R.: ICDAR 2003 robust reading competitions. In: Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings, pp. 682–687 (2003). https://doi.org/10.1109/ICDAR.2003.1227749

  28. Park, S., et al.: CORD: a consolidated receipt dataset for post-OCR parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)

    Google Scholar 

  29. Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2013)

    Google Scholar 

  30. Tian, Z., Huang, W., He, T., He, P., Qiao, Yu.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_4

    Chapter  Google Scholar 

  31. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  32. Vincent, L., Soille, P.: Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Trans. Pattern Anal. Mach. Intell. 13(6), 583–598 (1991). https://doi.org/10.1109/34.87344

    Article  Google Scholar 

  33. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: Proceedings of the 2011 International Conference on Computer Vision, pp. 1457–1464. ICCV 2011, IEEE Computer Society, USA (2011). https://doi.org/10.1109/ICCV.2011.6126402

  34. Xu, Y., et al.: LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol. 1 (Long Papers), pp. 2579–2591. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.acl-long.201. https://aclanthology.org/2021.acl-long.201

  35. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200 (2020)

    Google Scholar 

  36. Xue, C., Zhang, W., Hao, Y., Lu, S., Torr, P.H.S., Bai, S.: Language matters: a weakly supervised vision-language pre-training approach for scene text detection and spotting. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 284–302. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_17

    Chapter  Google Scholar 

  37. Yang, Z., et al.: UniTAB: unifying text and box outputs for grounded vision-language modeling. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 521–539. Springer, Cham (2022)

    Chapter  Google Scholar 

  38. Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., Bai, X.: Multi-oriented text detection with fully convolutional networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4159–4167 (2016). https://doi.org/10.1109/CVPR.2016.451

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Geewook Kim .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kim, G., Yokoo, S., Seo, S., Osanai, A., Okamoto, Y., Baek, Y. (2023). On Text Localization in End-to-End OCR-Free Document Understanding Transformer Without Text Localization Supervision. In: Coustaty, M., Fornés, A. (eds) Document Analysis and Recognition – ICDAR 2023 Workshops. ICDAR 2023. Lecture Notes in Computer Science, vol 14193. Springer, Cham. https://doi.org/10.1007/978-3-031-41498-5_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-41498-5_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-41497-8

  • Online ISBN: 978-3-031-41498-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics