Automatic Generation of Semi-structured Documents

Belhadj, Djedjiga; Belaïd, Yolande; Belaïd, Abdel

doi:10.1007/978-3-030-86159-9_13

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12917))

Included in the following conference series:

International Conference on Document Analysis and Recognition

1939 Accesses
3 Citations

Abstract

In this paper, we present a generator of semi structured documents (SSDs). This generator can provide samples of administrative documents that are useful for learning information extraction systems. It can also take care of the document annotation operation which is generally difficult to do and time consuming. We propose a general structure for SSDs and we prove that it perfectly works on three SSD types: invoices, payslips and receipts. Both the content and the layout are managed by random variables allowing them to be varied and to obtain different samples. These documents have some sort of similarity that gives them a common global model with particularities for each of them. The generator outputs the documents on three formats: pdf, xml and tiff image. We add an evaluation step to choose an adequate dataset for the learning process and avoid the overfitting. We can easily extend the actual implementation (https://github.com/fairandsmart/facogen) to other SSD types. We use this generator results to experiment an information extraction system from SSDs.

Supported by BPI DeepTech.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Building datasets to support information extraction and structure parsing from electronic theses and dissertations

Article Open access 03 May 2024

Structuring Semi-structured Data from Building Inspection Reports Using a Large Language Model

LLM Based Multi-agent Generation of Semi-structured Documents from Semantic Templates in the Public Administration Domain

References

Blanchard, J., Belaïd, Y., Belaïd, A.: Automatic generation of a custom corpora for invoice analysis and recognition. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 7, p. 1. IEEE (2019)
Google Scholar
Gupta, K., Achille, A., Lazarow, J., Davis, L., Mahadevan, V., Shrivastava, A.: Layout generation and completion with self-attention. arXiv preprint arXiv:2006.14615 (2020)
Huang, Z., et al.: ICDAR 2019 competition on scanned receipt OCR and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520. IEEE (2019)
Google Scholar
Lee, H.-Y., et al.: Neural design network: graphic layout generation with constraints. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 491–506. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_29
Chapter Google Scholar
Li, J., Yang, J., Hertzmann, A., Zhang, J., Xu, T.: LayoutGAN: generating graphic layouts with wireframe discriminators. arXiv preprint arXiv:1901.06767 (2019)
Nauata, N., Chang, K.-H., Cheng, C.-Y., Mori, G., Furukawa, Y.: House-GAN: relational generative adversarial networks for graph-constrained house layout generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 162–177. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_10
Chapter Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Patil, A.G., Ben-Eliezer, O., Perel, O., Averbuch-Elor, H.: READ: recursive autoencoders for document layout generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 544–545 (2020)
Google Scholar
Van Beusekom, J., Keysers, D., Shafait, F., Breuel, T.M.: Distance measures for layout-based document image retrieval. In: Second International Conference on Document Image Analysis for Libraries (DIAL 2006), p. 11-pp. IEEE (2006)
Google Scholar
Zhu, Y., et al.: Texygen: a benchmarking platform for text generation models. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1097–1100 (2018)
Google Scholar

Download references

Acknowledgements

This work was carried out within the framework of the BPI DeepTech project, in partnership between the University of Lorraine (Ref. UL: GECO/2020/00331), the CNRS, the INRIA Lorraine and the company FAIR&SMART. The authors would like to thank all the partners for their fruitful collaboration.

Author information

Authors and Affiliations

Université de Lorraine-LORIA, Campus Scientifique, 54500, Vandoeuvre-lès-Nancy, France
Djedjiga Belhadj, Yolande Belaïd & Abdel Belaïd

Authors

Djedjiga Belhadj
View author publications
You can also search for this author in PubMed Google Scholar
Yolande Belaïd
View author publications
You can also search for this author in PubMed Google Scholar
Abdel Belaïd
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Boise State University, Boise, ID, USA
Elisa H. Barney Smith
Indian Statistical Institute, Kolkata, India
Umapada Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Belhadj, D., Belaïd, Y., Belaïd, A. (2021). Automatic Generation of Semi-structured Documents. In: Barney Smith, E.H., Pal, U. (eds) Document Analysis and Recognition – ICDAR 2021 Workshops. ICDAR 2021. Lecture Notes in Computer Science(), vol 12917. Springer, Cham. https://doi.org/10.1007/978-3-030-86159-9_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-86159-9_13
Published: 02 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86158-2
Online ISBN: 978-3-030-86159-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)