ABSTRACT
Wrapping is the process of navigating a data source, semi-automatically extracting data and transforming it into a form suitable for data processing applications. There are currently a number of established products on the market for wrapping data from web pages. One such approach is Lixto [1], a product of research performed at our institute.Our work is concerned with extending the wrapping functionality of Lixto to PDF documents. As the PDF format is relatively unstructured, this is a challenging task. We have developed a method to segment the page into blocks, which are represented as nodes in a relational graph. This paper describes our current research in the use of relational matching techniques on this graph to locate wrapping instances.
- R. Baumgartner, S. Flesca, and G. Gottlob. Visual web information extraction with lixto. In The VLDB Journal, pages 119--128, 2001. Google ScholarDigital Library
- W. J. Christmas, J. Kittler, and M. Petrou. Structural matching in computer vision using probabilistic relaxation. IEEE Tran. on Pattern Anal. and Mach. Intel., 17(8):749--764, Aug. 1995. Google ScholarDigital Library
- J. Llados, E. Marti, and J. J. Villanueva. Symbol recognition by error-tolerant subgraph matching between region adjacency graphs. IEEE Tran. on Pattern Anal. and Mach. Intel., 23(10):1137--1143, Oct. 2001. Google ScholarDigital Library
Index Terms
- Using graph matching techniques to wrap data from PDF documents
Recommendations
User-Guided Wrapping of PDF Documents Using Graph Matching Techniques
ICDAR '09: Proceedings of the 2009 10th International Conference on Document Analysis and RecognitionThere are a number of established products on the market for wrapping - semi-automatic navigation and extraction of data - from web pages.These solutions make use of the inherent structure of HTML to locate instances of data to be wrapped.As PDF ...
Making accessible PDF documents
DocEng '11: Proceedings of the 11th ACM symposium on Document engineeringAccessibility features in the Adobe Portable Document Format (PDF) help facilitate access to electronic information for people with disabilities. This workshop explores how to create accessible PDF documents, from within Adobe Acrobat and other ...
Document understanding of graphical content in natively digital PDF documents
DocEng '12: Proceedings of the 2012 ACM symposium on Document engineeringThis paper presents an object-based method for analysing the content drawn by graphical operators in natively digital PDF documents. We propose that graphical content in a document can be classified either as structural or non-structural and present an ...
Comments