OCR-Free Document Understanding Transformer

Kim, Geewook; Hong, Teakgyu; Yim, Moonbin; Nam, JeongYeon; Park, Jinyoung; Yim, Jinyeong; Hwang, Wonseok; Yun, Sangdoo; Han, Dongyoon; Park, Seunghyun

doi:10.1007/978-3-031-19815-1_29

Geewook Kim¹²,
Teakgyu Hong¹⁵,
Moonbin Yim¹³,
JeongYeon Nam¹²,
Jinyoung Park¹⁶,
Jinyeong Yim¹⁷,
Wonseok Hwang¹⁸,
Sangdoo Yun¹⁴,
Dongyoon Han¹⁴ &
…
Seunghyun Park¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13688))

Included in the following conference series:

European Conference on Computer Vision

4409 Accesses
109 Citations

Abstract

Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document. Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs. Although such OCR-based approaches have shown promising performance, they suffer from 1) high computational costs for using OCR; 2) inflexibility of OCR models on languages or types of documents; 3) OCR error propagation to the subsequent process. To address these issues, in this paper, we introduce a novel OCR-free VDU model named Donut, which stands for Document understanding transformer. As the first step in OCR-free VDU research, we propose a simple architecture (i.e., Transformer) with a pre-training objective (i.e., cross-entropy loss). Donut is conceptually simple yet effective. Through extensive experiments and analyses, we show a simple OCR-free VDU model, Donut, achieves state-of-the-art performances on various VDU tasks in terms of both speed and accuracy. In addition, we offer a synthetic data generator that helps the model pre-training to be flexible in various languages and domains. The code, trained model, and synthetic data are available at https://github.com/clovaai/donut.

T. Hong, M. Yim, J. Park, J. Yim and W. Hwang—This work was done while the authors were at NAVER CLOVA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multi-page Document Visual Question Answering Using Self-attention Scoring Mechanism

DocPedia: unleashing the power of large multimodal model in the frequency domain for versatile document understanding

Article 13 December 2024

On Text Localization in End-to-End OCR-Free Document Understanding Transformer Without Text Localization Supervision

Notes

References

Afzal, M.Z., et al.: Deepdocclassifier: document classification with deep convolutional neural network. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1111–1115 (2015). https://doi.org/10.1109/ICDAR.2015.7333933
Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y., Manmatha, R.: Docformer: end-to-end transformer for document understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 993–1003, October 2021
Google Scholar
Baek, J., et al.: What is wrong with scene text recognition model comparisons? Dataset and model analysis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character region awareness for text detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9357–9366 (2019). https://doi.org/10.1109/CVPR.2019.00959
Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
Davis, B., Morse, B., Cohen, S., Price, B., Tensmeyer, C.: Deep visual template-free form parsing. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 134–141 (2019). https://doi.org/10.1109/ICDAR.2019.00030
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. Association for Computational Linguistics, June 2019. https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. OpenReview.net (2021). https://openreview.net/forum?id=YicbFdNTTy
Duong, Q., Hämäläinen, M., Hengchen, S.: An unsupervised method for OCR post-correction and spelling normalisation for Finnish. In: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), Sweden, Reykjavik, Iceland, 31 May–2 June 2021, pp. 240–248. Linköping University Electronic Press (2021). https://aclanthology.org/2021.nodalida-main.24
Friedl, J.E.F.: Mastering Regular Expressions, 3 edn. O’Reilly, Beijing (2006). https://www.safaribooksonline.com/library/view/mastering-regular-expressions/0596528124/
Guo, H., Qin, X., Liu, J., Han, J., Liu, J., Ding, E.: Eaten: entity-aware attention for single shot visual text extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 254–259 (2019). https://doi.org/10.1109/ICDAR.2019.00049
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Google Scholar
Hammami, M., Héroux, P., Adam, S., d’Andecy, V.P.: One-shot field spotting on colored forms using subgraph isomorphism. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 586–590 (2015). https://doi.org/10.1109/ICDAR.2015.7333829
Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 991–995 (2015). https://doi.org/10.1109/ICDAR.2015.7333910
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: Bros: a pre-trained language model focusing on text and layout for better key information extraction from documents. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, pp. 10767–10775, June 2022. https://doi.org/10.1609/aaai.v36i10.21322. https://ojs.aaai.org/index.php/AAAI/article/view/21322
Huang, W., Qiao, Yu., Tang, X.: Robust scene text detection with convolution neural network induced MSER trees. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 497–511. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_33
Chapter Google Scholar
Huang, Z., et al.: ICDAR 2019 competition on scanned receipt OCR and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520 (2019). https://doi.org/10.1109/ICDAR.2019.00244
Hwang, A., Frey, W.R., McKeown, K.: Towards augmenting lexical resources for slang and African American English. In: Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, Barcelona, Spain, pp. 160–172. International Committee on Computational Linguistics (ICCL), December 2020. https://aclanthology.org/2020.vardial-1.15
Hwang, W., et al.: Post-OCR parsing: building simple and robust parser via bio tagging. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)
Google Scholar
Hwang, W., Lee, H., Yim, J., Kim, G., Seo, M.: Cost-effective end-to-end information extraction for semi-structured document images. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3375–3383. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, November 2021. https://doi.org/10.18653/v1/2021.emnlp-main.271. https://aclanthology.org/2021.emnlp-main.271
Hwang, W., Yim, J., Park, S., Yang, S., Seo, M.: Spatial dependency parsing for semi-structured document information extraction. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 330–343. Association for Computational Linguistics, August 2021. https://doi.org/10.18653/v1/2021.findings-acl.28. https://aclanthology.org/2021.findings-acl.28
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. In: Workshop on Deep Learning, NIPS (2014)
Google Scholar
Kang, L., Kumar, J., Ye, P., Li, Y., Doermann, D.S.: Convolutional neural networks for document image classification. In: 2014 22nd International Conference on Pattern Recognition, pp. 3168–3172 (2014)
Google Scholar
Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160 (2015). https://doi.org/10.1109/ICDAR.2015.7333942
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, 18–24 July 2021, vol. 139, pp. 5583–5594. PMLR (2021). http://proceedings.mlr.press/v139/kim21k.html
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (2015), http://arxiv.org/abs/1412.6980
Klaiman, S., Lehne, M.: DocReader: bounding-box free training of a document information extraction model. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 451–465. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_29
Chapter Google Scholar
Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building a test collection for complex document information processing. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, pp. 665–666. Association for Computing Machinery, New York (2006). https://doi.org/10.1145/1148170.1148307
Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880. Association for Computational Linguistics, July 2020. https://doi.org/10.18653/v1/2020.acl-main.703. https://aclanthology.org/2020.acl-main.703
Li, C., et al.: StructuralLM: structural pre-training for form understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 6309–6318. Association for Computational Linguistics, August 2021. https://doi.org/10.18653/v1/2021.acl-long.493. https://aclanthology.org/2021.acl-long.493
Li, P., et al.: SelfDoc: self-supervised document representation learning. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5648–5656 (2021). https://doi.org/10.1109/CVPR46437.2021.00560
Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: Textboxes: a fast text detector with a single deep neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, February 2017. https://doi.org/10.1609/aaai.v31i1.11196. https://ojs.aaai.org/index.php/AAAI/article/view/11196
Liu, W., Chen, C., Wong, K.Y.K., Su, Z., Han, J.: Star-net: a spatial attention residue network for scene text recognition. In: Richard C. Wilson, E.R.H., Smith, W.A.P. (eds.) Proceedings of the British Machine Vision Conference (BMVC), pp. 43.1–43.13. BMVA Press, September 2016. https://doi.org/10.5244/C.30.43
Liu, Y., et al.: Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 8, 726–742 (2020). https://aclanthology.org/2020.tacl-1.47
Liu, Y., Chen, H., Shen, C., He, T., Jin, L., Wang, L.: ABCNet: real-time scene text spotting with adaptive Bezier-curve network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022, October 2021
Google Scholar
Long, S., Yao, C.: Unrealtext: synthesizing realistic scene text images from the unreal world. arXiv preprint arXiv:2003.10608 (2020)
Majumder, B.P., Potti, N., Tata, S., Wendt, J.B., Zhao, Q., Najork, M.: Representation learning for information extraction from form-like documents. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6495–6504. Association for Computational Linguistics, July 2020. https://doi.org/10.18653/v1/2020.acl-main.580. https://www.aclweb.org/anthology/2020.acl-main.580
Majumder, B.P., Potti, N., Tata, S., Wendt, J.B., Zhao, Q., Najork, M.: Representation learning for information extraction from form-like documents. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6495–6504. Association for Computational Linguistics, July 2020. https://doi.org/10.18653/v1/2020.acl-main.580. https://aclanthology.org/2020.acl-main.580
Mathew, M., Karatzas, D., Jawahar, C.: DocVQA: a dataset for VQA on document images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2200–2209 (2021)
Google Scholar
Park, S., et al.: Cord: a consolidated receipt dataset for post-OCR parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)
Google Scholar
Peng, D., et al.: SPTS: Single-Point Text Spotting. CoRR abs/2112.07917 (2021). https://arxiv.org/abs/2112.07917
Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2013
Google Scholar
Powalski, R., Borchmann, Ł, Jurkiewicz, D., Dwojak, T., Pietruszka, M., Pałka, G.: Going full-TILT boogie on document understanding with text-image-layout transformer. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 732–747. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_47
Chapter Google Scholar
Riba, P., Dutta, A., Goldmann, L., Fornés, A., Ramos, O., Lladó, J.: Table detection in invoice documents by graph neural networks. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 122–127 (2019). https://doi.org/10.1109/ICDAR.2019.00028
Rijhwani, S., Anastasopoulos, A., Neubig, G.: OCR post correction for endangered language texts. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5931–5942. Association for Computational Linguistics, November 2020. https://doi.org/10.18653/v1/2020.emnlp-main.478. https://aclanthology.org/2020.emnlp-main.478
Schaefer, R., Neudecker, C.: A two-step approach for automatic OCR post-correction. In: Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 52–57. International Committee on Computational Linguistics, December 2020. https://aclanthology.org/2020.latechclfl-1.6
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2298–2304 (2017)
Article Google Scholar
Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4168–4176 (2016). https://doi.org/10.1109/CVPR.2016.452
Taghva, K., Beckley, R., Coombs, J.: The effects of OCR error on the extraction of private information. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 348–357. Springer, Heidelberg (2006). https://doi.org/10.1007/11669487_31
Chapter Google Scholar
Tan, M., Le, Q.: Efficientnetv2: smaller models and faster training. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, 18–24 July 2021, vol. 139, pp. 10096–10106. PMLR (2021). https://proceedings.mlr.press/v139/tan21a.html
Tian, Z., Huang, W., He, T., He, P., Qiao, Y.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_4
Chapter Google Scholar
Tito, R., Mathew, M., Jawahar, C.V., Valveny, E., Karatzas, D.: ICDAR 2021 competition on document visual question answering. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12824, pp. 635–649. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_42
Chapter Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Wang, J., Hu, X.: Gated recurrent convolution neural network for OCR. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/c24cd76e1ce41366a4bbe8a49b02a028-Paper.pdf
Wang, S., Li, B., Khabsa, M., Fang, H., Ma, H.: Linformer: self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020)
Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1(2), 270–280 (1989)
Article Google Scholar
Xu, Y., et al.: LayoutLMv2: multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2579–2591. Association for Computational Linguistics, August 2021. https://doi.org/10.18653/v1/2021.acl-long.201. https://aclanthology.org/2021.acl-long.201
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layoutlm: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2020, pp. 1192–1200. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3394486.3403172
Xu, Y., et al.: Layoutxlm: multimodal pre-training for multilingual visually-rich document understanding. arXiv preprint arXiv:2104.08836 (2021)
Yim, M., Kim, Y., Cho, H.-C., Park, S.: SynthTIGER: synthetic text image GEneratoR towards better text recognition models. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12824, pp. 109–124. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_8
Chapter Google Scholar
Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18, 1245–1262 (1989). https://doi.org/10.1137/0218082
Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., Bai, X.: Multi-oriented text detection with fully convolutional networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4159–4167 (2016). https://doi.org/10.1109/CVPR.2016.451
Zhong, X., ShafieiBavani, E., Jimeno Yepes, A.: Image-based table recognition: data, model, and evaluation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 564–580. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_34
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

NAVER CLOVA, Seongnam-si, South Korea
Geewook Kim, JeongYeon Nam & Seunghyun Park
NAVER Search, Seongnam-si, South Korea
Moonbin Yim
NAVER AI Lab, Seongnam-si, South Korea
Sangdoo Yun & Dongyoon Han
Upstage, Yongin-si, South Korea
Teakgyu Hong
Tmax, Seongnam-si, South Korea
Jinyoung Park
Google, Seoul, South Korea
Jinyeong Yim
LBox, Seoul, South Korea
Wonseok Hwang

Authors

Geewook Kim
View author publications
You can also search for this author in PubMed Google Scholar
Teakgyu Hong
View author publications
You can also search for this author in PubMed Google Scholar
Moonbin Yim
View author publications
You can also search for this author in PubMed Google Scholar
JeongYeon Nam
View author publications
You can also search for this author in PubMed Google Scholar
Jinyoung Park
View author publications
You can also search for this author in PubMed Google Scholar
Jinyeong Yim
View author publications
You can also search for this author in PubMed Google Scholar
Wonseok Hwang
View author publications
You can also search for this author in PubMed Google Scholar
Sangdoo Yun
View author publications
You can also search for this author in PubMed Google Scholar
Dongyoon Han
View author publications
You can also search for this author in PubMed Google Scholar
Seunghyun Park
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Geewook Kim .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2739 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, G. et al. (2022). OCR-Free Document Understanding Transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13688. Springer, Cham. https://doi.org/10.1007/978-3-031-19815-1_29

Download citation

DOI: https://doi.org/10.1007/978-3-031-19815-1_29
Published: 20 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19814-4
Online ISBN: 978-3-031-19815-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics