Visual information extraction

Aumann, Yonatan; Feldman, Ronen; Liberzon, Yair; Rosenfeld, Benjamin; Schler, Jonathan

doi:10.1007/s10115-006-0014-x

Visual information extraction

Regular Paper
Published: 04 April 2006

Volume 10, pages 1–15, (2006)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Yonatan Aumann^1,2,
Ronen Feldman^1,2,
Yair Liberzon²,
Benjamin Rosenfeld² &
…
Jonathan Schler¹

218 Accesses
17 Citations
3 Altmetric
Explore all metrics

Abstract

Typographic and visual information is an integral part of textual documents. Most information extraction (IE) systems ignore most of this visual information, processing the text as a linear sequence of words. Thus, much valuable information is lost. In this paper, we show how to make use of this visual information for IE. We present an algorithm that allows to automatically extract specific fields of the document (such as the title, author, etc.) based exclusively on the visual formatting of the document, without any reference to the semantic content. The algorithm employs a machine learning approach, whereby the system is first provided with a set of training documents in which the target fields are manually tagged and automatically learns how to extract these fields in future documents. We implemented the algorithm in a system for automatic analysis of documents in PDF format. We present experimental results of applying the system on a set of financial documents, extracting nine different target fields. Overall, the system achieved a 90% accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Proceedings of the seventh message understanding conference (MUC-7) Available at: http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_toc.html
Proceedings of the third message understanding conference (MUC-3) (1991) Morgan Kaufmann
Proceedings of the forth message understanding conference (MUC-4) (1992) Morgan Kaufmann
Proceedings of the fifth message understanding conference (MUC-5) (1993) Morgan Kaufmann
Proceedings of the sixth message understanding conference (MUC-6) (1995) Morgan Kaufmann
Altamura O, Esposito F, Malerba D (2001) Transforming paper documents into XML format with WISDOM ++ . Int J Document Anal Recog 4(1):2–17
Google Scholar
Anjewierden A. AIDAS: incremental logical structure discovery in pdf documents. In: Proceedings of the sixth international conference on document analysis and recognition (ICDAR), pp 374–378
Ashish N, Knoblock C (1997) Wrapper generation for semi-structured internet sources. In: Proceedings of the workshop on management of semistructured data, Tucson
Berardi M, Lapi M, Malerba D (2004) An integrated approach for automatic semantic structure extraction in document images. In: Marinai S, Dengel A (eds) Document analysis systems. Lecture Notes in Computer Science, vol 3163. Springer-Verlag, Berlin Heidelberg New York, pp 179–190
Bright L, Gruser JR, Raschid L, Vidal ME (1999) A wrapper generation toolkit to specify and construct wrappers for Web accessible data sources (WebSources). Int J Comput Syst Sci Eng 14(2):83–97
Google Scholar
Califf ME, Mooney RJ (1999) Relational learning of pattern-match rules for information extraction. In: AAAI99/IAAI99: Proceedings of the sixteenth national conference on artificial intelligence and the eleventh innovative applications of artificial intelligence conference innovative applications of artificial intelligence, pp 328–334
Chao H, Beretta G, Sang H (2001) PDF document layout study with page elements and bounding boxes. In: Workshop on document layout interpretation and its applications (DLIA2001)
Eikvil L (1999) Information extraction from world wide web – a survey. Technical Report 945, Norweigan Computing Center
Esposito F, Malerba D, Lisi FA (2000) Machine learning for intelligent processing of printed documents. J Intell Inform Syst 14(2/3):178–198
Google Scholar
Etzioni O, Weld D (1994) A softbot-based interface to the internet. Commun ACM 37(7):72–76
Article Google Scholar
Freitag D (1998) Toward general-purpose learning for information extraction. In: Proceedings of the thirty-sixth annual meeting of the association for computational linguistics and seventeenth international conference on computational linguistics, pp 404–408
Friedman M, Weld DS (1997) Efficiently executing information-gathering plans. In: 15th international joint conference on artificial intelligence, Nagoya, Japan, pp 785–791
Futrelle RP, Shao M, Cieslik C, Grimes AE (2003) Extraction, layout analysis and classification of diagrams in PDF documents. In: Proceedings of the seventh international conference on document analysis and recognition, IEEE, pp 1007–1015
Hammer J, Garcıa-Molina H, Nestorov S, Yerneni R, Breunig M, Vassalos V (1997) Template-based wrappers in the TSIMMIS system. In: Proceedings of the twenty-third ACM SIGMOD international conference on management of data, pp 532–535
Hsu CN, Dung MT (1998) Generating finite-state transducers for semi-structured data extraction from the web. Inform Syst 23(8):521–538
Article Google Scholar
Kushmerick N (2000) Wrapper induction: Efficiency and expressiveness. Artif Intell 118(1–2):15–68
Article MATH MathSciNet Google Scholar
Lewis JW (1991) Wrappers: integration utilities and services for the DICE architecture. In: Proceedings of the second national symposium on concurrent engineering, pp 445–457
Lovegrove WS, Brailsford DF (1995) Document analysis of PDF files: methods, results and implications. Electron Publish 8(2/3):207–220
Google Scholar
Muslea I, Minton S, Knoblock CA (2001) Hierarchical wrapper induction for semistructured information sources. Autonom Agents Multi-Agent Syst 4(1/2):93–114
Article Google Scholar
Papageorgiou C, Poggio T (2000) A trainable system for object detection. Int J Comput Vis 38(1):15–33
Article MATH Google Scholar
Papakonstantinou Y, Gupta A, Garcia-Molina H, Ullman JD (1995) A query translation scheme for rapid implementation of wrappers. In: 4th intenational conference on deductive and object-oriented databases, LNCS, vol E1013. Springer, Berlin Heidelberg New York, pp 319–344
Poggio T, Edelman S (1990) Network that learns to recognize 3D objects. Nature 343:263–266
Google Scholar
Rosenfeld B, Feldman R, Aumann Y (2002) Structural extraction from visual layout of documents. In: Proceedings of the eleventh international conference on information and knowledge management, pp 203–210
Selberg E, Etzioni O (1997) The MetaCrawler architecture for resource aggregation on the Web. IEEE Expert 12(1):8–14
Article Google Scholar
Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34(1–3):233–272
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Bar Ilan University, Ramat Gan, 52900, Israel
Yonatan Aumann, Ronen Feldman & Jonathan Schler
ClearForest Ltd., 6 Yoni Netanyahu Street, Yehuda, 60376, Israel
Yonatan Aumann, Ronen Feldman, Yair Liberzon & Benjamin Rosenfeld

Authors

Yonatan Aumann
View author publications
You can also search for this author in PubMed Google Scholar
Ronen Feldman
View author publications
You can also search for this author in PubMed Google Scholar
Yair Liberzon
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Rosenfeld
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Schler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ronen Feldman.

Additional information

A preliminary description of this work appeared in [28].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aumann, Y., Feldman, R., Liberzon, Y. et al. Visual information extraction. Knowl Inf Syst 10, 1–15 (2006). https://doi.org/10.1007/s10115-006-0014-x

Download citation

Received: 01 March 2005
Revised: 15 June 2005
Accepted: 05 July 2005
Published: 04 April 2006
Issue Date: July 2006
DOI: https://doi.org/10.1007/s10115-006-0014-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual information extraction

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Guided Search 6.0: An updated model of visual search

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Visual information extraction

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Guided Search 6.0: An updated model of visual search

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation