Abstract
In this paper, we present a generator of semi structured documents (SSDs). This generator can provide samples of administrative documents that are useful for learning information extraction systems. It can also take care of the document annotation operation which is generally difficult to do and time consuming. We propose a general structure for SSDs and we prove that it perfectly works on three SSD types: invoices, payslips and receipts. Both the content and the layout are managed by random variables allowing them to be varied and to obtain different samples. These documents have some sort of similarity that gives them a common global model with particularities for each of them. The generator outputs the documents on three formats: pdf, xml and tiff image. We add an evaluation step to choose an adequate dataset for the learning process and avoid the overfitting. We can easily extend the actual implementation (https://github.com/fairandsmart/facogen) to other SSD types. We use this generator results to experiment an information extraction system from SSDs.
Supported by BPI DeepTech.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Blanchard, J., Belaïd, Y., Belaïd, A.: Automatic generation of a custom corpora for invoice analysis and recognition. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 7, p. 1. IEEE (2019)
Gupta, K., Achille, A., Lazarow, J., Davis, L., Mahadevan, V., Shrivastava, A.: Layout generation and completion with self-attention. arXiv preprint arXiv:2006.14615 (2020)
Huang, Z., et al.: ICDAR 2019 competition on scanned receipt OCR and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520. IEEE (2019)
Lee, H.-Y., et al.: Neural design network: graphic layout generation with constraints. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 491–506. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_29
Li, J., Yang, J., Hertzmann, A., Zhang, J., Xu, T.: LayoutGAN: generating graphic layouts with wireframe discriminators. arXiv preprint arXiv:1901.06767 (2019)
Nauata, N., Chang, K.-H., Cheng, C.-Y., Mori, G., Furukawa, Y.: House-GAN: relational generative adversarial networks for graph-constrained house layout generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 162–177. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_10
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Patil, A.G., Ben-Eliezer, O., Perel, O., Averbuch-Elor, H.: READ: recursive autoencoders for document layout generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 544–545 (2020)
Van Beusekom, J., Keysers, D., Shafait, F., Breuel, T.M.: Distance measures for layout-based document image retrieval. In: Second International Conference on Document Image Analysis for Libraries (DIAL 2006), p. 11-pp. IEEE (2006)
Zhu, Y., et al.: Texygen: a benchmarking platform for text generation models. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1097–1100 (2018)
Acknowledgements
This work was carried out within the framework of the BPI DeepTech project, in partnership between the University of Lorraine (Ref. UL: GECO/2020/00331), the CNRS, the INRIA Lorraine and the company FAIR&SMART. The authors would like to thank all the partners for their fruitful collaboration.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Belhadj, D., Belaïd, Y., Belaïd, A. (2021). Automatic Generation of Semi-structured Documents. In: Barney Smith, E.H., Pal, U. (eds) Document Analysis and Recognition – ICDAR 2021 Workshops. ICDAR 2021. Lecture Notes in Computer Science(), vol 12917. Springer, Cham. https://doi.org/10.1007/978-3-030-86159-9_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-86159-9_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86158-2
Online ISBN: 978-3-030-86159-9
eBook Packages: Computer ScienceComputer Science (R0)