ICDAR 2023 Competition on Document UnderstanDing of Everything (DUDE)

Van Landeghem, Jordy; Tito, Rubèn; Borchmann, Łukasz; Pietruszka, Michał; Jurkiewicz, Dawid; Powalski, Rafał; Józiak, Paweł; Biswas, Sanket; Coustaty, Mickaël; Stanisławek, Tomasz

doi:10.1007/978-3-031-41679-8_24

Jordy Van Landeghem^11,12,
Rubèn Tito¹⁵,
Łukasz Borchmann¹³,
Michał Pietruszka^13,16,
Dawid Jurkiewicz^13,17,
Rafał Powalski¹⁸,
Paweł Józiak^13,14,
Sanket Biswas¹⁵,
Mickaël Coustaty¹⁹ &
…
Tomasz Stanisławek^13,14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14188))

Included in the following conference series:

International Conference on Document Analysis and Recognition

1111 Accesses
2 Citations

Abstract

This paper presents the results of the ICDAR 2023 competition on Document UnderstanDing of Everything. DUDE introduces a new dataset comprising 5 K visually-rich documents (VRDs) with 40 K questions with novelties related to types of questions, answers, and document layouts based on multi-industry, multi-domain, and multi-page VRDs of various origins and dates. The competition was structured as a single task with a multi-phased evaluation protocol that assesses the few-shot capabilities of models by testing generalization to previously unseen questions and domains, a condition essential to business use cases prevailing in the field. A new and independent diagnostic test set is additionally constructed for fine-grained performance analysis. A thorough analysis of results from different participant methods is presented. Under the newly studied settings, current state-of-the-art models show a significant performance gap, even when improving visual evidence and handling multi-page documents. We conclude that the DUDE dataset proposed in this competition will be an essential, long-standing benchmark to further explore for achieving improved generalization and adaptation under low-resource fine-tuning, as desired in the real world.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y., Manmatha, R.: DocFormer: end-to-end transformer for document understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 993–1003 (2021)
Google Scholar
Biten, A.F., et al.: ICDAR 2019 competition on scene text visual question answering. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1563–1570. IEEE (2019)
Google Scholar
Biten, A.F., et al.: Scene text visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2019)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Google Scholar
Dhuliawala, S., Adolphs, L., Das, R., Sachan, M.: Calibration of machine reading systems at scale. In: Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, Ireland, pp. 1682–1693 (2022). https://doi.org/10.18653/v1/2022.findings-acl.133, https://aclanthology.org/2022.findings-acl.133
Dídac, S., Menon, S., Vondrick, C.: ViperGPT: visual inference via python execution for reasoning. arXiv preprint: arXiv:2303.08128 (2023)
Geifman, Y., El-Yaniv, R.: Selective classification for deep neural networks. In: Advances in neural information processing systems, vol. 30 (2017)
Google Scholar
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: Proceedings of the 34th International Conference on Machine Learning, ICML’17, vol. 70, pp. 1321–1330 (2017)
Google Scholar
Gupta, T., Kembhavi, A.: Visual programming: compositional visual reasoning without training. arXiv preprint: arXiv:2211.11559 (2022)
Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: LayoutLMv3: pre-training for document AI with unified text and image masking, MM ’22, pp. 4083–4091. Association for Computing Machinery, New York (2022). https://doi.org/10.1145/3503161.3548112
Jaeger, P.F., Lüth, C.T., Klein, L., Bungert, T.J.: A call to reflect on evaluation practices for failure detection in image classification. In: International Conference on Learning Representations (2023). https://openreview.net/forum?id=YnkGMIh0gvX
Jaume, G., Ekenel, H.K., Thiran, J.P.: FUNSD: a dataset for form understanding in noisy scanned documents. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 2, pp. 1–6. IEEE (2019)
Google Scholar
Jimeno Yepes, A., Zhong, P., Burdick, D.: ICDAR 2021 competition on scientific literature parsing. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12824, pp. 605–617. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_40
Chapter Google Scholar
Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L., Levy, O.: SpanBERT: improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguist. 8, 64–77 (2020)
Article Google Scholar
Kamath, A., Jia, R., Liang, P.: Selective question answering under domain shift. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5684–5696 (2020)
Google Scholar
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint: arXiv:2301.12597 (2023)
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Google Scholar
Liang, S., Li, Y., Srikant, R.: Enhancing the reliability of out-of-distribution image detection in neural networks. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=H1VGkIxRZ
Lin, S., Hilton, J., Evans, O.: Teaching models to express their uncertainty in words. Trans. Mach. Learn. Res. (2022). https://openreview.net/forum?id=8s8K2UZGTZ
Mathew, M., Bagal, V., Tito, R., Karatzas, D., Valveny, E., Jawahar, C.: InfographicVQA. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1697–1706 (2022)
Google Scholar
Mathew, M., Tito, R., Karatzas, D., Manmatha, R., Jawahar, C.: Document visual question answering challenge 2020. arXiv preprint: arXiv:2008.08899 (2020)
Naeini, M.P., Cooper, G., Hauskrecht, M.: Obtaining well calibrated probabilities using Bayesian binning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29 (2015)
Google Scholar
Powalski, R., Borchmann, Ł, Jurkiewicz, D., Dwojak, T., Pietruszka, M., Pałka, G.: Going full-TILT boogie on document understanding with text-image-layout transformer. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 732–747. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_47
Chapter Google Scholar
Qiao, L., et al.: LGPMA: complicated table structure recognition with local and global pyramid mask alignment. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 99–114. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_7
Chapter Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)
MathSciNet MATH Google Scholar
Raja, S., Mondal, A., Jawahar, C.: ICDAR 2023 competition on visual question answering on business document images (2023)
Google Scholar
Saikh, T., Ghosal, T., Mittal, A., Ekbal, A., Bhattacharyya, P.: ScienceQA: a novel resource for question answering on scholarly articles. Int. J. Digit. Libr. 23(3), 289–301 (2022)
Article Google Scholar
Stanisławek, T., et al.: Kleister: key information extraction datasets involving long documents with complex layouts. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 564–579. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_36
Chapter Google Scholar
Tang, Z., et al.: Unifying vision, text, and layout for universal document processing. arXiv preprint: arXiv:2212.02623 (2022)
Tito, R., Karatzas, D., Valveny, E.: Document collection visual question answering. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 778–792. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_50
Chapter Google Scholar
Tito, R., Karatzas, D., Valveny, E.: Hierarchical multimodal transformers for multi-page DocVQA. arXiv preprint: arXiv:2212.05935 (2022)
Tito, R., Mathew, M., Jawahar, C.V., Valveny, E., Karatzas, D.: ICDAR 2021 competition on document visual question answering. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12824, pp. 635–649. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_42
Chapter Google Scholar
Van Landeghem, J., et al.: Document understanding dataset and evaluation (DUDE). In: International Conference on Computer Vision (2023)
Google Scholar
Yang, Y., Wang, H., Katabi, D.: On multi-domain long-tailed recognition, imbalanced domain generalization and beyond. In: Computer Vision - ECCV 2022: 17th European Conference, Proceedings, Part XX, Tel Aviv, Israel, 23–27 October 2022, pp. 57–75. Springer-Verlag, Berlin, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20044-1_4
Yang, Z., Qi, P., et al.: HotpotQA: a dataset for diverse, explainable multi-hop question answering. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369–2380. Association for Computational Linguistics, Brussels (2018). https://doi.org/10.18653/v1/D18-1259, https://aclanthology.org/D18-1259
Zhong, X., Tang, J., Yepes, A.J.: PubLayNet: largest dataset ever for document layout analysis. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1015–1022. IEEE (2019)
Google Scholar

Download references

Acknowledgment

Jordy Van Landeghem acknowledges the financial support of VLAIO (Flemish Innovation & Entrepreneurship) through the Baekeland Ph.D. mandate (HBC.2019.2604). The Smart Growth Operational Programme partially supported this research under projects no. POIR.01.01.01-00-1624/20 (Hiper-OCR - an innovative solution for information extraction from scanned documents) and POIR.01.01.01-00-0605/19 (Disruptive adoption of Neural Language Modelling for automation of text-intensive work).

Author information

Authors and Affiliations

KU Leuven, Leuven, Belgium
Jordy Van Landeghem
Contract.fit, Brussels, Belgium
Jordy Van Landeghem
Snowflake, Bozeman, USA
Łukasz Borchmann, Michał Pietruszka, Dawid Jurkiewicz, Paweł Józiak & Tomasz Stanisławek
Warsaw University of Technology, Warsaw, Poland
Paweł Józiak & Tomasz Stanisławek
Computer Vision Center, Universitat Autónoma de Barcelona, Barcelona, Spain
Rubèn Tito & Sanket Biswas
Jagiellonian University, Kraków, Poland
Michał Pietruszka
Adam Mickiewicz University, Poznań, Poland
Dawid Jurkiewicz
Instabase, San Francisco, USA
Rafał Powalski
University of La Rochelle, La Rochelle, France
Mickaël Coustaty

Authors

Jordy Van Landeghem
View author publications
You can also search for this author in PubMed Google Scholar
Rubèn Tito
View author publications
You can also search for this author in PubMed Google Scholar
Łukasz Borchmann
View author publications
You can also search for this author in PubMed Google Scholar
Michał Pietruszka
View author publications
You can also search for this author in PubMed Google Scholar
Dawid Jurkiewicz
View author publications
You can also search for this author in PubMed Google Scholar
Rafał Powalski
View author publications
You can also search for this author in PubMed Google Scholar
Paweł Józiak
View author publications
You can also search for this author in PubMed Google Scholar
Sanket Biswas
View author publications
You can also search for this author in PubMed Google Scholar
Mickaël Coustaty
View author publications
You can also search for this author in PubMed Google Scholar
Tomasz Stanisławek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jordy Van Landeghem .

Editor information

Editors and Affiliations

TU Dortmund University, Dortmund, Germany
Gernot A. Fink
Adobe, College Park, MN, USA
Rajiv Jain
Osaka Metropolitan University, Osaka, Japan
Koichi Kise
Rochester Institute of Technology, Rochester, NY, USA
Richard Zanibbi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Van Landeghem, J. et al. (2023). ICDAR 2023 Competition on Document UnderstanDing of Everything (DUDE). In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14188. Springer, Cham. https://doi.org/10.1007/978-3-031-41679-8_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-41679-8_24
Published: 19 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41678-1
Online ISBN: 978-3-031-41679-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

ICDAR 2023 Competition on Document UnderstanDing of Everything (DUDE)