Semantic information extraction from images of complex documents

Peanho, Claudio Antonio; Stagni, Henrique; da Silva, Flavio Soares Correa

doi:10.1007/s10489-012-0348-x

Semantic information extraction from images of complex documents

Published: 25 April 2012

Volume 37, pages 543–557, (2012)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Claudio Antonio Peanho¹,
Henrique Stagni¹ &
Flavio Soares Correa da Silva²

530 Accesses
13 Citations
Explore all metrics

Abstract

Even though the digital processing of documents is increasingly widespread in industry, printed documents are still largely in use. In order to process electronically the contents of printed documents, information must be extracted from digital images of documents. When dealing with complex documents, in which the contents of different regions and fields can be highly heterogeneous with respect to layout, printing quality and the utilization of fonts and typing standards, the reconstruction of the contents of documents from digital images can be a difficult problem. In the present article we present an efficient solution for this problem, in which the semantic contents of fields in a complex document are extracted from a digital image.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multiple Types of Semi-structured Data Extraction Using Wrapper for Extraction of Image Using DOM (WEID)

Representation of Edge Detection Results Based on Graph Theory

Image understanding and the web: a state-of-the-art review

Article 12 June 2014

Notes

In these equations, P.x and P.y stand for the horizontal and the vertical coordinates of pixel P, respectively.
Tesseract is an efficient OCR engine originally developed at Hewlett-Packard between 1984 and 1994. In 1995 it was ranked as one of the top three OCR engines in the UNLV Accuracy Test. Tesseract was open-sourced in late 2005.

References

Belaïd A, Poulain VD, Hamza H, Belaïd Y (2011) Administrative document analysis and structure. In: Biba M (ed) Learning structure and schemas from documents. Springer, Berlin
Google Scholar
Breuel TM (2001) A practical, globally optimal algorithm for geometric matching under uncertainty. In: Proc international workshop on combinatorial image analysis, IWCIA 2001, pp 1–15
Google Scholar
Breuel TM (2003) High performance document layout analysis. In: Proceedings of symposium on document image understanding technology
Google Scholar
Breuel TM (2009) Recent progress on the OCRopus OCR system. In: Proceedings of the international workshop on multilingual OCR, MOCR’09, pp 2:1–2:10
Google Scholar
Canny JF (1986) A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell 8(6):679–698
Article Google Scholar
Cesarini F, Gori M, Marinai S, Soda G (1998) INFORMys: a flexible invoice-like form-reader system. IEEE Trans Pattern Anal Mach Intell 20(7):730–745
Article Google Scholar
Cesarini F, Francesconi E, Gori M, Soda G (2003) Analysis and understanding of multi-class invoices. Int J Doc Anal Recognit 6:102–114
Article Google Scholar
Deriche R, Giraudon G (1993) A computational approach for corner and vertex detection. Int J Comput Vis 10(2):101–124
Article Google Scholar
Duygulu P, Atalay V (2002) A hierarchical representation of form documents for identification and retrieval. Int J Doc Anal Recognit 5(1):17–27
Article MATH Google Scholar
Eshera MA, Fu KS (1986) An image understanding system using attributed symbolic representation and inexact graph-matching. IEEE Trans Pattern Anal Mach Intell 8(5):604–618
Article Google Scholar
Hamza H, Belaïd Y, Belaïd A (2007) A case-based reasoning approach for invoice structure extraction. In: Bortolozzi F, Sabourin R (eds) 9th international conference on document analysis and recognition, ICDAR’07, IAPR, Curitiba, Brazil, vol 1. IEEE Press, New York, pp 327–331
Google Scholar
Hamza H, Belaïd Y, Belaïd A, Chaudhuri BB (2008) An end-to-end administrative document analysis system. In: The eighth IAPR international workshop on document analysis systems, DAS 2008, Nara, Japon. IEEE Computer Society, Los Alamitos, pp 175–182
Chapter Google Scholar
Lee LH, Wan CH, Rajkumar R, Isa D (2011) An enhanced support vector machine classification framework by using euclidean distance function for text document categorization. In: Applied intelligence, 1st edn. Springer, Berlin, (online)
Google Scholar
Maini R, Aggrawal H (2009) Study and comparison of various image edge detection techniques. Int J Image Process 3:1–11
Article Google Scholar
Pedrini H, Schwartz WR (2006) Analise de imagens digitais. Thomson Publ. (in Portuguese)
Google Scholar
Schulz F, Ebbecke M, Gillmann M, Adrian B, Agne S, Dengel A (2009) Seizing the treasure: transferring knowledge in invoice analysis. In: Proceedings of the 2009 10th international conference on document analysis and recognition, ICDAR ’09, Washington, DC, USA. IEEE Computer Society, Los Alamitos, pp 848–852
Chapter Google Scholar
Shivakumara P, Huang W, Quy Phan T, Lim Tan C (2010) Accurate video text detection through classification of low and high contrast images. Pattern Recognit 43(6):2165–2185
Article Google Scholar
Yshitani Y (2001) Model-based information extraction and its applications for document images. In: Workshop on document layout interpretation and its applications, DLIA 2001
Google Scholar
Yuan Q, Tan CL (2000) Page segmentation and text extraction from gray scale images in microfilm format. SPIE Proc Doc Recognit Retr 4307:323–332
Article Google Scholar

Download references

Acknowledgements

This work has been partially supported by Opus Software. We thank the anonymous reviewers for the many comments and suggestions upon preliminary versions of this work, which have helped greatly improve the quality of the final version of this article.

Author information

Authors and Affiliations

Opus Software Ltd, Rua Eugenio de Medeiros 242, Sao Paulo, Brazil
Claudio Antonio Peanho & Henrique Stagni
University of Sao Paulo, Rua do Matao 1010, Sao Paulo, Brazil
Flavio Soares Correa da Silva

Authors

Claudio Antonio Peanho
View author publications
You can also search for this author in PubMed Google Scholar
Henrique Stagni
View author publications
You can also search for this author in PubMed Google Scholar
Flavio Soares Correa da Silva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Flavio Soares Correa da Silva.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Peanho, C.A., Stagni, H. & da Silva, F.S.C. Semantic information extraction from images of complex documents. Appl Intell 37, 543–557 (2012). https://doi.org/10.1007/s10489-012-0348-x

Download citation

Published: 25 April 2012
Issue Date: December 2012
DOI: https://doi.org/10.1007/s10489-012-0348-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantic information extraction from images of complex documents

Abstract

Access this article

Similar content being viewed by others

Multiple Types of Semi-structured Data Extraction Using Wrapper for Extraction of Image Using DOM (WEID)

Representation of Edge Detection Results Based on Graph Theory

Image understanding and the web: a state-of-the-art review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Semantic information extraction from images of complex documents

Abstract

Access this article

Similar content being viewed by others

Multiple Types of Semi-structured Data Extraction Using Wrapper for Extraction of Image Using DOM (WEID)

Representation of Edge Detection Results Based on Graph Theory

Image understanding and the web: a state-of-the-art review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation