Abstract
Even though the digital processing of documents is increasingly widespread in industry, printed documents are still largely in use. In order to process electronically the contents of printed documents, information must be extracted from digital images of documents. When dealing with complex documents, in which the contents of different regions and fields can be highly heterogeneous with respect to layout, printing quality and the utilization of fonts and typing standards, the reconstruction of the contents of documents from digital images can be a difficult problem. In the present article we present an efficient solution for this problem, in which the semantic contents of fields in a complex document are extracted from a digital image.











Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
In these equations, P.x and P.y stand for the horizontal and the vertical coordinates of pixel P, respectively.
Tesseract is an efficient OCR engine originally developed at Hewlett-Packard between 1984 and 1994. In 1995 it was ranked as one of the top three OCR engines in the UNLV Accuracy Test. Tesseract was open-sourced in late 2005.
References
Belaïd A, Poulain VD, Hamza H, Belaïd Y (2011) Administrative document analysis and structure. In: Biba M (ed) Learning structure and schemas from documents. Springer, Berlin
Breuel TM (2001) A practical, globally optimal algorithm for geometric matching under uncertainty. In: Proc international workshop on combinatorial image analysis, IWCIA 2001, pp 1–15
Breuel TM (2003) High performance document layout analysis. In: Proceedings of symposium on document image understanding technology
Breuel TM (2009) Recent progress on the OCRopus OCR system. In: Proceedings of the international workshop on multilingual OCR, MOCR’09, pp 2:1–2:10
Canny JF (1986) A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell 8(6):679–698
Cesarini F, Gori M, Marinai S, Soda G (1998) INFORMys: a flexible invoice-like form-reader system. IEEE Trans Pattern Anal Mach Intell 20(7):730–745
Cesarini F, Francesconi E, Gori M, Soda G (2003) Analysis and understanding of multi-class invoices. Int J Doc Anal Recognit 6:102–114
Deriche R, Giraudon G (1993) A computational approach for corner and vertex detection. Int J Comput Vis 10(2):101–124
Duygulu P, Atalay V (2002) A hierarchical representation of form documents for identification and retrieval. Int J Doc Anal Recognit 5(1):17–27
Eshera MA, Fu KS (1986) An image understanding system using attributed symbolic representation and inexact graph-matching. IEEE Trans Pattern Anal Mach Intell 8(5):604–618
Hamza H, Belaïd Y, Belaïd A (2007) A case-based reasoning approach for invoice structure extraction. In: Bortolozzi F, Sabourin R (eds) 9th international conference on document analysis and recognition, ICDAR’07, IAPR, Curitiba, Brazil, vol 1. IEEE Press, New York, pp 327–331
Hamza H, Belaïd Y, Belaïd A, Chaudhuri BB (2008) An end-to-end administrative document analysis system. In: The eighth IAPR international workshop on document analysis systems, DAS 2008, Nara, Japon. IEEE Computer Society, Los Alamitos, pp 175–182
Lee LH, Wan CH, Rajkumar R, Isa D (2011) An enhanced support vector machine classification framework by using euclidean distance function for text document categorization. In: Applied intelligence, 1st edn. Springer, Berlin, (online)
Maini R, Aggrawal H (2009) Study and comparison of various image edge detection techniques. Int J Image Process 3:1–11
Pedrini H, Schwartz WR (2006) Analise de imagens digitais. Thomson Publ. (in Portuguese)
Schulz F, Ebbecke M, Gillmann M, Adrian B, Agne S, Dengel A (2009) Seizing the treasure: transferring knowledge in invoice analysis. In: Proceedings of the 2009 10th international conference on document analysis and recognition, ICDAR ’09, Washington, DC, USA. IEEE Computer Society, Los Alamitos, pp 848–852
Shivakumara P, Huang W, Quy Phan T, Lim Tan C (2010) Accurate video text detection through classification of low and high contrast images. Pattern Recognit 43(6):2165–2185
Yshitani Y (2001) Model-based information extraction and its applications for document images. In: Workshop on document layout interpretation and its applications, DLIA 2001
Yuan Q, Tan CL (2000) Page segmentation and text extraction from gray scale images in microfilm format. SPIE Proc Doc Recognit Retr 4307:323–332
Acknowledgements
This work has been partially supported by Opus Software. We thank the anonymous reviewers for the many comments and suggestions upon preliminary versions of this work, which have helped greatly improve the quality of the final version of this article.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Peanho, C.A., Stagni, H. & da Silva, F.S.C. Semantic information extraction from images of complex documents. Appl Intell 37, 543–557 (2012). https://doi.org/10.1007/s10489-012-0348-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-012-0348-x