Abstract
The purpose of this research is to reverse engineer the process of encoding data in structured documents and subsequently automate the process of extracting it. We assume a broad category of structured documents for processing that goes beyond form processing. In fact, the documents may have flexible layouts and consist of multiple and varying numbers of pages. The data extraction method (DataX) employs general templates generated by the Inductive Template Generator (InTeGen). The InTeGen method utilizes inductive learning from examples of documents with identified data elements. Both methods achieve high automation with minimal user’s input.
Chapter PDF
Similar content being viewed by others
Keywords
- Principle Component Analysis
- Relevant Document
- Document Image
- Document Representation
- Original Feature Space
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Bayer, T., Mogg-Schneider, H., “A Generic System for Processing Invoices,” Proc. Int. Conf. on Doc. Analysis and Recognition, pp.740–744, IEEE Computer Society Press, 1997.
Cesarini, F., Francesconi, E., Gori, M., and Soda, G., “A Two Level Knowledge Approach for Understanding Documents of a Multi-Class Domain,” Proc. Int. Conf. on Doc. Analysis and Recognition, pp.135–138, IEEE Computer Society Press, 1999.
Dengel, A., “ANASTASIL: A System for Low-Level and High-Level Geometric Analysis of Printed Documents” in Structured Document Image Analysis, Springer-Verlag, Berlin, 1992.
Esposito, F., Malerba, D., and Semeraro, G., “Multistrategy Learning for Document Recognition,” Applied Artificial Intelligence, Vol. 8, pp.33–94, 1994.
Koppen, M., Waldostl, D., and Nickolay, B., “A System for the Evaluation of Invoices,” in Document Analysis Systems II, pp. 223–241, World Scientific, 1998.
Summers, K., “Near-Wordless Document Structure Classification,” Proc. Int. Conf. On Document Analysis and Recognition, IEEE Computer Society Press, 1995.
Wnek, J., “Learning to Identify Hundreds of Flex-form Documents,” Proc. of SPIE, Document Recognition and Retrieval VI, D. Lopresti and J. Zhou Eds., Vol. 3651, pp. 173–182, 1999.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wnek, J. (2002). Machine Learning of Generalized Document Templates for Data Extraction. In: Lopresti, D., Hu, J., Kashi, R. (eds) Document Analysis Systems V. DAS 2002. Lecture Notes in Computer Science, vol 2423. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45869-7_48
Download citation
DOI: https://doi.org/10.1007/3-540-45869-7_48
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44068-0
Online ISBN: 978-3-540-45869-2
eBook Packages: Springer Book Archive