Skip to main content
Log in

Semantic information extraction from images of complex documents

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Even though the digital processing of documents is increasingly widespread in industry, printed documents are still largely in use. In order to process electronically the contents of printed documents, information must be extracted from digital images of documents. When dealing with complex documents, in which the contents of different regions and fields can be highly heterogeneous with respect to layout, printing quality and the utilization of fonts and typing standards, the reconstruction of the contents of documents from digital images can be a difficult problem. In the present article we present an efficient solution for this problem, in which the semantic contents of fields in a complex document are extracted from a digital image.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 7
Fig. 6
Algorithm 1
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. In these equations, P.x and P.y stand for the horizontal and the vertical coordinates of pixel P, respectively.

  2. Tesseract is an efficient OCR engine originally developed at Hewlett-Packard between 1984 and 1994. In 1995 it was ranked as one of the top three OCR engines in the UNLV Accuracy Test. Tesseract was open-sourced in late 2005.

References

  1. Belaïd A, Poulain VD, Hamza H, Belaïd Y (2011) Administrative document analysis and structure. In: Biba M (ed) Learning structure and schemas from documents. Springer, Berlin

    Google Scholar 

  2. Breuel TM (2001) A practical, globally optimal algorithm for geometric matching under uncertainty. In: Proc international workshop on combinatorial image analysis, IWCIA 2001, pp 1–15

    Google Scholar 

  3. Breuel TM (2003) High performance document layout analysis. In: Proceedings of symposium on document image understanding technology

    Google Scholar 

  4. Breuel TM (2009) Recent progress on the OCRopus OCR system. In: Proceedings of the international workshop on multilingual OCR, MOCR’09, pp 2:1–2:10

    Google Scholar 

  5. Canny JF (1986) A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell 8(6):679–698

    Article  Google Scholar 

  6. Cesarini F, Gori M, Marinai S, Soda G (1998) INFORMys: a flexible invoice-like form-reader system. IEEE Trans Pattern Anal Mach Intell 20(7):730–745

    Article  Google Scholar 

  7. Cesarini F, Francesconi E, Gori M, Soda G (2003) Analysis and understanding of multi-class invoices. Int J Doc Anal Recognit 6:102–114

    Article  Google Scholar 

  8. Deriche R, Giraudon G (1993) A computational approach for corner and vertex detection. Int J Comput Vis 10(2):101–124

    Article  Google Scholar 

  9. Duygulu P, Atalay V (2002) A hierarchical representation of form documents for identification and retrieval. Int J Doc Anal Recognit 5(1):17–27

    Article  MATH  Google Scholar 

  10. Eshera MA, Fu KS (1986) An image understanding system using attributed symbolic representation and inexact graph-matching. IEEE Trans Pattern Anal Mach Intell 8(5):604–618

    Article  Google Scholar 

  11. Hamza H, Belaïd Y, Belaïd A (2007) A case-based reasoning approach for invoice structure extraction. In: Bortolozzi F, Sabourin R (eds) 9th international conference on document analysis and recognition, ICDAR’07, IAPR, Curitiba, Brazil, vol 1. IEEE Press, New York, pp 327–331

    Google Scholar 

  12. Hamza H, Belaïd Y, Belaïd A, Chaudhuri BB (2008) An end-to-end administrative document analysis system. In: The eighth IAPR international workshop on document analysis systems, DAS 2008, Nara, Japon. IEEE Computer Society, Los Alamitos, pp 175–182

    Chapter  Google Scholar 

  13. Lee LH, Wan CH, Rajkumar R, Isa D (2011) An enhanced support vector machine classification framework by using euclidean distance function for text document categorization. In: Applied intelligence, 1st edn. Springer, Berlin, (online)

    Google Scholar 

  14. Maini R, Aggrawal H (2009) Study and comparison of various image edge detection techniques. Int J Image Process 3:1–11

    Article  Google Scholar 

  15. Pedrini H, Schwartz WR (2006) Analise de imagens digitais. Thomson Publ. (in Portuguese)

    Google Scholar 

  16. Schulz F, Ebbecke M, Gillmann M, Adrian B, Agne S, Dengel A (2009) Seizing the treasure: transferring knowledge in invoice analysis. In: Proceedings of the 2009 10th international conference on document analysis and recognition, ICDAR ’09, Washington, DC, USA. IEEE Computer Society, Los Alamitos, pp 848–852

    Chapter  Google Scholar 

  17. Shivakumara P, Huang W, Quy Phan T, Lim Tan C (2010) Accurate video text detection through classification of low and high contrast images. Pattern Recognit 43(6):2165–2185

    Article  Google Scholar 

  18. Yshitani Y (2001) Model-based information extraction and its applications for document images. In: Workshop on document layout interpretation and its applications, DLIA 2001

    Google Scholar 

  19. Yuan Q, Tan CL (2000) Page segmentation and text extraction from gray scale images in microfilm format. SPIE Proc Doc Recognit Retr 4307:323–332

    Article  Google Scholar 

Download references

Acknowledgements

This work has been partially supported by Opus Software. We thank the anonymous reviewers for the many comments and suggestions upon preliminary versions of this work, which have helped greatly improve the quality of the final version of this article.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Flavio Soares Correa da Silva.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Peanho, C.A., Stagni, H. & da Silva, F.S.C. Semantic information extraction from images of complex documents. Appl Intell 37, 543–557 (2012). https://doi.org/10.1007/s10489-012-0348-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-012-0348-x

Keywords

Navigation