Article

Using graph matching techniques to wrap data from PDF documents

Authors:
Tamir Hassan

Vienna University of Technology, Wien, Austria

Vienna University of Technology, Wien, Austria
View Profile

,
Robert Baumgartner

Vienna University of Technology, Wien, Austria

Vienna University of Technology, Wien, Austria
View Profile

WWW '06: Proceedings of the 15th international conference on World Wide WebMay 2006Pages 901–902https://doi.org/10.1145/1135777.1135935

Published:23 May 2006Publication History

WWW '06: Proceedings of the 15th international conference on World Wide Web

Pages 901–902

ABSTRACT

Wrapping is the process of navigating a data source, semi-automatically extracting data and transforming it into a form suitable for data processing applications. There are currently a number of established products on the market for wrapping data from web pages. One such approach is Lixto [1], a product of research performed at our institute.Our work is concerned with extending the wrapping functionality of Lixto to PDF documents. As the PDF format is relatively unstructured, this is a challenging task. We have developed a method to segment the page into blocks, which are represented as nodes in a relational graph. This paper describes our current research in the use of relational matching techniques on this graph to locate wrapping instances.

References

R. Baumgartner, S. Flesca, and G. Gottlob. Visual web information extraction with lixto. In The VLDB Journal, pages 119--128, 2001. Google ScholarDigital Library
W. J. Christmas, J. Kittler, and M. Petrou. Structural matching in computer vision using probabilistic relaxation. IEEE Tran. on Pattern Anal. and Mach. Intel., 17(8):749--764, Aug. 1995. Google ScholarDigital Library
J. Llados, E. Marti, and J. J. Villanueva. Symbol recognition by error-tolerant subgraph matching between region adjacency graphs. IEEE Tran. on Pattern Anal. and Mach. Intel., 23(10):1137--1143, Oct. 2001. Google ScholarDigital Library

Index Terms

Using graph matching techniques to wrap data from PDF documents
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Document analysis
2. Information systems
  1. Information retrieval

Recommendations

User-Guided Wrapping of PDF Documents Using Graph Matching Techniques
ICDAR '09: Proceedings of the 2009 10th International Conference on Document Analysis and Recognition

There are a number of established products on the market for wrapping - semi-automatic navigation and extraction of data - from web pages.These solutions make use of the inherent structure of HTML to locate instances of data to be wrapped.As PDF ...
Read More
Making accessible PDF documents
DocEng '11: Proceedings of the 11th ACM symposium on Document engineering

Accessibility features in the Adobe Portable Document Format (PDF) help facilitate access to electronic information for people with disabilities. This workshop explores how to create accessible PDF documents, from within Adobe Acrobat and other ...
Read More
Document understanding of graphical content in natively digital PDF documents
DocEng '12: Proceedings of the 2012 ACM symposium on Document engineering

This paper presents an object-based method for analysing the content drawn by graphical operators in natively digital PDF documents. We propose that graphical content in a document can be classified either as structural or non-structural and present an ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '06: Proceedings of the 15th international conference on World Wide Web
May 2006
1102 pages
ISBN:1595933239
DOI:10.1145/1135777
General Chairs:
Leslie Carr
University of Southampton
,
David De Roure
University of Southampton
,
Arun Iyengar
IBM Research
,
Program Chairs:
Carole Goble
University of Manchester, UK
,
Mike Dahlin
University of Texas at Austin
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 May 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
PDF
document understanding
graph matching
logical structure
wrapping
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 310
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Using graph matching techniques to wrap data from PDF documents

WWW '06: Proceedings of the 15th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

User-Guided Wrapping of PDF Documents Using Graph Matching Techniques

Making accessible PDF documents

Document understanding of graphical content in natively digital PDF documents