Data-Efficient Information Extraction from Documents with Pre-trained Language Models

Sage, Clément; Douzon, Thibault; Aussem, Alex; Eglin, Véronique; Elghazel, Haytham; Duffner, Stefan; Garcia, Christophe; Espinas, Jérémy

doi:10.1007/978-3-030-86159-9_33

Clément Sage^10,11,
Thibault Douzon^10,11,
Alex Aussem¹⁰,
Véronique Eglin¹⁰,
Haytham Elghazel¹⁰,
Stefan Duffner¹⁰,
Christophe Garcia¹⁰ &
…
Jérémy Espinas¹¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12917))

Included in the following conference series:

International Conference on Document Analysis and Recognition

2186 Accesses
6 Citations

Abstract

Like for many text understanding and generation tasks, pre-trained languages models have emerged as a powerful approach for extracting information from business documents. However, their performance has not been properly studied in data-constrained settings which are often encountered in industrial applications. In this paper, we show that LayoutLM, a pre-trained model recently proposed for encoding 2D documents, reveals a high sample-efficiency when fine-tuned on public and real-world Information Extraction (IE) datasets. Indeed, LayoutLM reaches more than 80% of its full performance with as few as 32 documents for fine-tuning. When compared with a strong baseline learning IE from scratch, the pre-trained model needs between 4 to 30 times fewer annotated documents in the toughest data conditions. Finally, LayoutLM performs better on the real-world dataset when having been beforehand fine-tuned on the full public dataset, thus indicating valuable knowledge transfer abilities. We therefore advocate the use of pre-trained language models for tackling practical extraction problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Improving Information Extraction on Business Documents with Specific Pre-training Tasks

Large language models for generative information extraction: a survey

Article Open access 11 November 2024

Data Expansion for Resource-Constrained Keyphrase Generation

Notes

1.
https://github.com/microsoft/unilm/tree/master/layoutlm.
2.
The metric values are obtained at: https://rrc.cvc.uab.es/?ch=13&com=evaluation&task=3.
3.
https://github.com/clemsage/unilm.

References

Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
Brown, T.B., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Virtual, 6–12 December 2020 (2020). https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
Cesarini, F., Gori, M., Marinai, S., Soda, G.: INFORMys: a flexible invoice-like form-reader system. IEEE Trans. Pattern Anal. Mach. Intell. 20(7), 730–745 (1998)
Article Google Scholar
Chen, Z., Eavani, H., Chen, W., Liu, Y., Wang, W.Y.: Few-shot NLG with pre-trained language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 183–190. Association for Computational Linguistics, July 2020. https://doi.org/10.18653/v1/2020.acl-main.18. https://www.aclweb.org/anthology/2020.acl-main.18
Chiticariu, L., Li, Y., Reiss, F.: Rule-based information extraction is dead! Long live rule-based information extraction systems! In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 827–832 (2013)
Google Scholar
Cohen, B., York, M.: Ardent partners’ accounts payable metrics that matter in 2020. Technical report, Ardent Partners (2020). http://ardentpartners.com/2020/ArdentPartners-AP-MTM2020-FINAL.pdf
Dang, T.A.N., Thanh, D.N.: End-to-end information extraction by character-level embedding and multi-stage attentional U-Net. In: 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, 9–12 September 2019, p. 96. BMVA Press (2019). https://bmvc2019.org/wp-content/uploads/papers/0870-paper.pdf
Denk, T.I., Reisswig, C.: BERTgrid: contextualized embedding for 2D document representation and understanding. In: Workshop on Document Intelligence at NeurIPS 2019 (2019). https://openreview.net/forum?id=H1gsGaq9US
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Google Scholar
Gal, R., Ardazi, S., Shilkrot, R.: Cardinal graph convolution framework for document information extraction. In: Proceedings of the ACM Symposium on Document Engineering 2020, pp. 1–11 (2020)
Google Scholar
Gardner, M., Berant, J., Hajishirzi, H., Talmor, A., Min, S.: Question answering is a format; when is it useful? arXiv preprint arXiv:1909.11291 (2019)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. JMLR Workshop and Conference Proceedings (2010)
Google Scholar
Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 991–995. IEEE (2015)
Google Scholar
Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: BROS: a pre-trained language model for understanding texts in document (2021). https://openreview.net/forum?id=punMXQEsPr0
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 328–339. Association for Computational Linguistics, July 2018. https://doi.org/10.18653/v1/P18-1031. https://www.aclweb.org/anthology/P18-1031
Huang, Z., et al.: ICDAR 2019 competition on scanned receipt OCR and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520. IEEE (2019)
Google Scholar
Jaume, G., Ekenel, H.K., Thiran, J.: FUNSD: a dataset for form understanding in noisy scanned documents. In: 2nd International Workshop on Open Services and Tools for Document Analysis, OST@ICDAR 2019, Sydney, Australia, 22–25 September 2019. pp. 1–6. IEEE (2019). https://doi.org/10.1109/ICDARW.2019.10029
Katti, A.R., et al.: Chargrid: towards understanding 2D documents. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4459–4469 (2018)
Google Scholar
Levy, O., Seo, M., Choi, E., Zettlemoyer, L.: Zero-shot relation extraction via reading comprehension. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 333–342 (2017)
Google Scholar
Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building a test collection for complex document information processing. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 665–666 (2006)
Google Scholar
Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., Jagadish, H.V.: Regular expression learning for information extraction. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Honolulu, Hawaii, pp. 21–30. Association for Computational Linguistics, October 2008. https://www.aclweb.org/anthology/D08-1003
Liu, X., Gao, F., Zhang, Q., Zhao, H.: Graph convolution for multimodal information extraction from visually rich documents. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers), Minneapolis, Minnesota, pp. 32–39. Association for Computational Linguistics, June 2019. https://doi.org/10.18653/v1/N19-2005. https://www.aclweb.org/anthology/N19-2005
Mathew, M., Karatzas, D., Jawahar, C.: DocVQA: a dataset for VQA on document images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 2200–2209, January 2021
Google Scholar
Motahari, H., Duffy, N., Bennett, P., Bedrax-Weiss, T.: A report on the first workshop on document intelligence (DI) at NeurIPS 2019. ACM SIGKDD Explor. Newslett. 22(2), 8–11 (2021)
Article Google Scholar
Palm, R.B., Laws, F., Winther, O.: Attend, copy, parse end-to-end information extraction from documents. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 329–336. IEEE (2019)
Google Scholar
Palm, R.B., Winther, O., Laws, F.: CloudScan - a configuration-free invoice analysis system using recurrent neural networks. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 406–413. IEEE (2017)
Google Scholar
Park, S., et al.: CORD: a consolidated receipt dataset for post-OCR parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019). https://openreview.net/forum?id=SJl3z659UH
Pramanik, S., Mujumdar, S., Patel, H.: Towards a multi-modal, multi-task learning based pre-training framework for document representation learning. arXiv preprint arXiv:2009.14457 (2020)
Qian, Y., Santus, E., Jin, Z., Guo, J., Barzilay, R.: GraphIE: a graph-based framework for information extraction. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 751–761 (2019)
Google Scholar
Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., Huang, X.: Pre-trained models for natural language processing: a survey. Sci. China Technol. Sci. 1–26 (2020)
Google Scholar
Ramshaw, L.A., Marcus, M.P.: Text chunking using transformation-based learning. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds.) Natural Language Processing Using very Large Corpora, pp. 157–176. Springer, Dordrecht (1999). https://doi.org/10.1007/978-94-017-2390-9_10
Sage, C., Aussem, A., Eglin, V., Elghazel, H., Espinas, J.: End-to-end extraction of structured information from business documents with pointer-generator networks. In: Proceedings of the Fourth Workshop on Structured Prediction for NLP, pp. 43–52. Association for Computational Linguistics, Online, November 2020. https://www.aclweb.org/anthology/2020.spnlp-1.6
Sage, C., Aussem, A., Elghazel, H., Eglin, V., Espinas, J.: Recurrent neural network approach for table field extraction in business documents. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1308–1313. IEEE (2019)
Google Scholar
Wei, M., He, Y., Zhang, Q.: Robust layout-aware IE for visually rich documents with pre-trained language models. In: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, 25–30 July 2020, pp. 2367–2376. ACM (2020). https://doi.org/10.1145/3397271.3401442
Xu, Y., et al.: LayoutLMv2: multi-modal pre-training for visually-rich document understanding. arXiv preprint arXiv:2012.14740 (2020)
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200 (2020)
Google Scholar
Yu, W., Lu, N., Qi, X., Gong, P., Xiao, R.: PICK: processing key information extraction from documents using improved graph learning-convolutional networks. arXiv preprint arXiv:2004.07464 (2020)
Zhang, T., Wu, F., Katiyar, A., Weinberger, K.Q., Artzi, Y.: Revisiting few-sample BERT fine-tuning. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=cO1IH43yUF
Zhao, X., Wu, Z., Wang, X.: CUTIE: learning to understand documents with convolutional universal text information extractor. arXiv preprint arXiv:1903.12363 (2019)

Download references

Acknowledgment

The work presented in this paper was supported by Esker. We thank them for providing the PO-51k dataset and for insightful discussions about these researches.

Author information

Authors and Affiliations

Univ Lyon, CNRS, LIRIS, Villeurbanne, France
Clément Sage, Thibault Douzon, Alex Aussem, Véronique Eglin, Haytham Elghazel, Stefan Duffner & Christophe Garcia
Esker, Villeurbanne, France
Clément Sage, Thibault Douzon & Jérémy Espinas

Authors

Clément Sage
View author publications
You can also search for this author in PubMed Google Scholar
Thibault Douzon
View author publications
You can also search for this author in PubMed Google Scholar
Alex Aussem
View author publications
You can also search for this author in PubMed Google Scholar
Véronique Eglin
View author publications
You can also search for this author in PubMed Google Scholar
Haytham Elghazel
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Duffner
View author publications
You can also search for this author in PubMed Google Scholar
Christophe Garcia
View author publications
You can also search for this author in PubMed Google Scholar
Jérémy Espinas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Clément Sage .

Editor information

Editors and Affiliations

Boise State University, Boise, ID, USA
Elisa H. Barney Smith
Indian Statistical Institute, Kolkata, India
Umapada Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sage, C. et al. (2021). Data-Efficient Information Extraction from Documents with Pre-trained Language Models. In: Barney Smith, E.H., Pal, U. (eds) Document Analysis and Recognition – ICDAR 2021 Workshops. ICDAR 2021. Lecture Notes in Computer Science(), vol 12917. Springer, Cham. https://doi.org/10.1007/978-3-030-86159-9_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-86159-9_33
Published: 02 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86158-2
Online ISBN: 978-3-030-86159-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)