Abstract
This paper presents the results of the ICDAR 2023 competition on Document UnderstanDing of Everything. DUDE introduces a new dataset comprising 5 K visually-rich documents (VRDs) with 40 K questions with novelties related to types of questions, answers, and document layouts based on multi-industry, multi-domain, and multi-page VRDs of various origins and dates. The competition was structured as a single task with a multi-phased evaluation protocol that assesses the few-shot capabilities of models by testing generalization to previously unseen questions and domains, a condition essential to business use cases prevailing in the field. A new and independent diagnostic test set is additionally constructed for fine-grained performance analysis. A thorough analysis of results from different participant methods is presented. Under the newly studied settings, current state-of-the-art models show a significant performance gap, even when improving visual evidence and handling multi-page documents. We conclude that the DUDE dataset proposed in this competition will be an essential, long-standing benchmark to further explore for achieving improved generalization and adaptation under low-resource fine-tuning, as desired in the real world.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y., Manmatha, R.: DocFormer: end-to-end transformer for document understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 993–1003 (2021)
Biten, A.F., et al.: ICDAR 2019 competition on scene text visual question answering. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1563–1570. IEEE (2019)
Biten, A.F., et al.: Scene text visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2019)
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Dhuliawala, S., Adolphs, L., Das, R., Sachan, M.: Calibration of machine reading systems at scale. In: Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, Ireland, pp. 1682–1693 (2022). https://doi.org/10.18653/v1/2022.findings-acl.133, https://aclanthology.org/2022.findings-acl.133
Dídac, S., Menon, S., Vondrick, C.: ViperGPT: visual inference via python execution for reasoning. arXiv preprint: arXiv:2303.08128 (2023)
Geifman, Y., El-Yaniv, R.: Selective classification for deep neural networks. In: Advances in neural information processing systems, vol. 30 (2017)
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: Proceedings of the 34th International Conference on Machine Learning, ICML’17, vol. 70, pp. 1321–1330 (2017)
Gupta, T., Kembhavi, A.: Visual programming: compositional visual reasoning without training. arXiv preprint: arXiv:2211.11559 (2022)
Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: LayoutLMv3: pre-training for document AI with unified text and image masking, MM ’22, pp. 4083–4091. Association for Computing Machinery, New York (2022). https://doi.org/10.1145/3503161.3548112
Jaeger, P.F., Lüth, C.T., Klein, L., Bungert, T.J.: A call to reflect on evaluation practices for failure detection in image classification. In: International Conference on Learning Representations (2023). https://openreview.net/forum?id=YnkGMIh0gvX
Jaume, G., Ekenel, H.K., Thiran, J.P.: FUNSD: a dataset for form understanding in noisy scanned documents. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 2, pp. 1–6. IEEE (2019)
Jimeno Yepes, A., Zhong, P., Burdick, D.: ICDAR 2021 competition on scientific literature parsing. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12824, pp. 605–617. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_40
Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L., Levy, O.: SpanBERT: improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguist. 8, 64–77 (2020)
Kamath, A., Jia, R., Liang, P.: Selective question answering under domain shift. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5684–5696 (2020)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint: arXiv:2301.12597 (2023)
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Liang, S., Li, Y., Srikant, R.: Enhancing the reliability of out-of-distribution image detection in neural networks. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=H1VGkIxRZ
Lin, S., Hilton, J., Evans, O.: Teaching models to express their uncertainty in words. Trans. Mach. Learn. Res. (2022). https://openreview.net/forum?id=8s8K2UZGTZ
Mathew, M., Bagal, V., Tito, R., Karatzas, D., Valveny, E., Jawahar, C.: InfographicVQA. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1697–1706 (2022)
Mathew, M., Tito, R., Karatzas, D., Manmatha, R., Jawahar, C.: Document visual question answering challenge 2020. arXiv preprint: arXiv:2008.08899 (2020)
Naeini, M.P., Cooper, G., Hauskrecht, M.: Obtaining well calibrated probabilities using Bayesian binning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29 (2015)
Powalski, R., Borchmann, Ł, Jurkiewicz, D., Dwojak, T., Pietruszka, M., Pałka, G.: Going full-TILT boogie on document understanding with text-image-layout transformer. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 732–747. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_47
Qiao, L., et al.: LGPMA: complicated table structure recognition with local and global pyramid mask alignment. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 99–114. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_7
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)
Raja, S., Mondal, A., Jawahar, C.: ICDAR 2023 competition on visual question answering on business document images (2023)
Saikh, T., Ghosal, T., Mittal, A., Ekbal, A., Bhattacharyya, P.: ScienceQA: a novel resource for question answering on scholarly articles. Int. J. Digit. Libr. 23(3), 289–301 (2022)
Stanisławek, T., et al.: Kleister: key information extraction datasets involving long documents with complex layouts. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 564–579. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_36
Tang, Z., et al.: Unifying vision, text, and layout for universal document processing. arXiv preprint: arXiv:2212.02623 (2022)
Tito, R., Karatzas, D., Valveny, E.: Document collection visual question answering. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 778–792. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_50
Tito, R., Karatzas, D., Valveny, E.: Hierarchical multimodal transformers for multi-page DocVQA. arXiv preprint: arXiv:2212.05935 (2022)
Tito, R., Mathew, M., Jawahar, C.V., Valveny, E., Karatzas, D.: ICDAR 2021 competition on document visual question answering. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12824, pp. 635–649. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_42
Van Landeghem, J., et al.: Document understanding dataset and evaluation (DUDE). In: International Conference on Computer Vision (2023)
Yang, Y., Wang, H., Katabi, D.: On multi-domain long-tailed recognition, imbalanced domain generalization and beyond. In: Computer Vision - ECCV 2022: 17th European Conference, Proceedings, Part XX, Tel Aviv, Israel, 23–27 October 2022, pp. 57–75. Springer-Verlag, Berlin, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20044-1_4
Yang, Z., Qi, P., et al.: HotpotQA: a dataset for diverse, explainable multi-hop question answering. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369–2380. Association for Computational Linguistics, Brussels (2018). https://doi.org/10.18653/v1/D18-1259, https://aclanthology.org/D18-1259
Zhong, X., Tang, J., Yepes, A.J.: PubLayNet: largest dataset ever for document layout analysis. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1015–1022. IEEE (2019)
Acknowledgment
Jordy Van Landeghem acknowledges the financial support of VLAIO (Flemish Innovation & Entrepreneurship) through the Baekeland Ph.D. mandate (HBC.2019.2604). The Smart Growth Operational Programme partially supported this research under projects no. POIR.01.01.01-00-1624/20 (Hiper-OCR - an innovative solution for information extraction from scanned documents) and POIR.01.01.01-00-0605/19 (Disruptive adoption of Neural Language Modelling for automation of text-intensive work).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Van Landeghem, J. et al. (2023). ICDAR 2023 Competition on Document UnderstanDing of Everything (DUDE). In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14188. Springer, Cham. https://doi.org/10.1007/978-3-031-41679-8_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-41679-8_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41678-1
Online ISBN: 978-3-031-41679-8
eBook Packages: Computer ScienceComputer Science (R0)