Article

Joint optimization of wrapper generation and template detection

Authors:
Shuyi Zheng

Pennsylvania State University

Pennsylvania State University
View Profile

,
Ruihua Song

Microsoft Research Asia

Microsoft Research Asia
View Profile

,
Ji-Rong Wen

Microsoft Research Asia

Microsoft Research Asia
View Profile

,
Di Wu

Chinese University of Hong Kong

Chinese University of Hong Kong
View Profile

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2007Pages 894–902https://doi.org/10.1145/1281192.1281287

Published:12 August 2007Publication History

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 894–902

ABSTRACT

Many websites have large collections of pages generated dynamically from an underlying structured source like a database. The data of a category are typically encoded into similar pages by a common script or template. In recent years, some value-added services, such as comparison shopping and vertical search in a specific domain, have motivated the research of extraction technologies with high accuracy. Almost all previous works assume that input pages of a wrapper induction system conform to a common template and they can be easily identified in terms of a common schema of URL. However, we observed that it is hard to distinguish different templates using dynamic URLs today. Moreover, since extraction accuracy heavily depends on how consistent input pages are, we argue that it is risky to determine whether pages share a common template solely based on URLs. Instead, we propose a new approach that utilizes similarity between pages to detect templates. Our approach separates pages with notable inner differences and then generates wrappers, respectively. Experimental results show that our proposed approach is feasible and effective for improving extraction accuracy.

References

http://www.w3.org/dom/.Google Scholar
A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pages 337--348, 2003. Google ScholarDigital Library
C.-H. Chang and S.-C. Lui. Iepad: information extraction based on pattern discovery. In Proceedings of the 10th International Conference on World Wide Web, pages 681--688, 2001. Google ScholarDigital Library
S.-L. Chuang and J. Y.-j. Hsu. Tree-structured template generation for web pages. In Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence, pages 327--333, 2004 Google ScholarDigital Library
W. W. Cohen, M. Hurst, and L. S. Jensen. A flexible learning system for wrapping tables and lists in html documents. In Proceedings of the 11th International Conference on World Wide Web, pages 232--241, 2002. Google ScholarDigital Library
V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of the 27th International Conference on Very Large Data Bases, pages 109--118, 2001. Google ScholarDigital Library
V. Crescenzi, G. Mecca, and P. Merialdo. Wrapping-oriented classification of web pages. In Proceedings of the 2002 ACM symposium on Applied computing, pages 1108--1112, 2002. Google ScholarDigital Library
S. Flesca, G. Manco, E. Masciari, E. Rende, and A. Tagarelli. Web wrapper induction: a brief survey. AI Communications, 17:57--61, 2004. Google ScholarDigital Library
J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha. Extracting semistructured information from the web. In Proceedings of the Workshop on Management fo Semistructured Data, 1997.Google Scholar
A. Hogue and D. Karger. Thresher: automating the unwrapping of semantic content from the world wide web. In Proceedings of 14th International Conference on World Wide Web, pages 86--95, 2005. Google ScholarDigital Library
C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, Special Issue on Semistructured Data, 23(8):521--538, 1998. Google ScholarDigital Library
N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 729--737, 1997.Google Scholar
A. H. F. Laender, B. A. Ribeiro-Neto, A. S. da Silva, and J. S. Teixeira. A brief survey of web data extraction tools. SIGMOD Record, 31(2):84--93, 2002. Google ScholarDigital Library
B. Liu. Web content mining (tutorial). In Proceedings of the 14th International Conference on World Wide Web, 2005.Google Scholar
B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 601--606, 2003. Google ScholarDigital Library
L. Liu, C. Pu, and W. Han. Xwrap: an xml-enabled wrapper construction system for web information sources. In Proceedings of the 16th International Conference on Data Engineering, pages 611--621, 2000. Google ScholarDigital Library
I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In Proceedings of the 3rd Annual Conference on Autonomous Agents, pages 190--197, 1999. Google ScholarDigital Library
D. C. Reis, P. B. Golgher, A. S. Silva, and A. F. Laender. Automatic web news extraction using tree edit distance. In Proceedings of the 13th International Conference on World Wide Web, pages 502--511, 2004. Google ScholarDigital Library
S. Sarawagi. Automation in information extraction and data integration (tutorial). In Proceedings of the 28th International Conference on Very Large Data Bases, 2002.Google Scholar
P. Willett. Recent trends in hierarchic document clustering: a critical review. Information Processing and Management, 24(5):577--597, 1988. Google ScholarDigital Library
H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully automatic wrapper generation for search engines. In Proceedings of the 14th International Conference on World Wide Web, pages 66--75, 2005. Google ScholarDigital Library

Index Terms

Joint optimization of wrapper generation and template detection
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Efficient record-level wrapper induction
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Web information is often presented in the form of record, e.g., a product record on a shopping website or a personal profile on a social utility website. Given a host webpage and related information needs, how to identify relevant records as well as ...
Read More
News article extraction with template-independent wrapper
WWW '09: Proceedings of the 18th international conference on World wide web

We consider the problem of template-independent news extraction. The state-of-the-art news extraction method is based on template-level wrapper induction, which has two serious limitations. 1) It cannot correctly extract pages belonging to an unseen ...
Read More
Adaptable wrapper generation for web page format change
ACOS'06: Proceedings of the 5th WSEAS international conference on Applied computer science

In this paper, we propose an adaptive wrapper generator that can generate adaptable wrapper for adapting networked information sources (NIS) format changes. When NIS's format changed, the adaptable wrapper can start recovery phase to discover the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2007
1080 pages
ISBN:9781595936097
DOI:10.1145/1281192
General Chair:
Pavel Berkhin
Yahoo!, USA
,
Program Chairs:
Rich Caruana
Cornell University, USA
,
Xindong Wu
University of Vermont, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 August 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
information extraction
template detection
wrapper
Qualifiers
- Article
Conference

Acceptance Rates
KDD '07 Paper Acceptance Rate111of573submissions,19%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 36
  Total Citations
  View Citations
- 762
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Joint optimization of wrapper generation and template detection

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient record-level wrapper induction

News article extraction with template-independent wrapper

Adaptable wrapper generation for web page format change