Elsevier

Computer Networks

Volume 39, Issue 5, 5 August 2002, Pages 635-644
Computer Networks

Effective Web data extraction with standard XML technologies

https://doi.org/10.1016/S1389-1286(02)00214-1Get rights and content

Abstract

We describe an Extensible Markup Language (XML)-based methodology for Web data extraction that extends beyond simple “screen scraping”. An ideal data extraction process can digest target Web databases that are visible only as Hypertext Markup Language (HTML) pages, and create a local replica of those databases as a result. What is needed is more than a Web crawler and set of Web site wrappers. A comprehensive data extraction process must deal with such obstacles as session identifiers, HTML forms, client-side JavaScript, incompatible datasets and vocabularies, and missing and conflicting data. Proper data extraction also requires solid data validation and error recovery to handle data extraction failures. Our ANDES software framework helps solve these problems and provides a platform for building a production–quality Web data extraction process. Key aspects of ANDES are that it uses XML technologies for data extraction, including Extensible HTML and Extensible Stylesheet Language Transformations, and provides access to the “deep Web”.

Introduction

Given the rapid growth and success of public information sources on the World Wide Web, it is increasingly attractive to extract data from these sources and make it available for further processing by end users and application programs. Data extracted from Web sites can serve as the springboard for a variety of tasks, including information retrieval (e.g., business intelligence), event monitoring (news and stock market), and electronic commerce (shopping comparison).

Extracting structured data from Web sites is not a trivial task. Most of the information on the Web today is in the form of Hypertext Markup Language (HTML) documents which are viewed by humans with a browser. HTML documents are sometimes written by hand, sometimes with the aid of HTML tools. Given that the format of HTML documents is designed for presentation purposes, not automated extraction, and the fact that most of the HTML content on the Web is ill-formed (“broken”), extracting data from such documents can be compared to the task of extracting structure from unstructured documents.

In the future, some if not most Web content may be available in formats more suitable for automated processing, in particular the Extensible Markup Language (XML) [17]. Despite being a relatively new development, XML has become absolutely essential for enabling data interchange between otherwise incompatible systems. However, the volume of XML content available on the Web today is still miniscule compared to that of HTML. It is therefore reasonable (and profitable) to study ways of translating existing HTML content to XML, and thereby expose more Web sites to automated processing by end users and application programs. The tools and techniques that we collectively know as Web data extraction are key to making this possible.

In this paper we focus on system-oriented issues in Web data extraction and describe our approach for building a dependable extraction process. Our ideas are manifested in ANDES1, a crawler-based Web data extraction framework and the backbone of several Web data extraction systems in production use at IBM.

Section snippets

Related work

Several research groups have focused on the problem of extracting structured data from HTML documents. Much of the research is in the context of a database system, and the focus is on wrappers that translate a database query to a Web request and parse the resulting HTML page. Our focus is on batch-oriented data extraction: crawling target Web sites, extracting structured data, performing domain-specific feature extraction and resolution of missing and conflicting data, and making the data

Extracting structured data from Web sites

Extracting structured data from Web sites requires solving five distinct problems: finding target HTML pages on a site by following hyperlinks (navigation problem), extracting relevant pieces of data from these pages (data extraction problem), distilling the data and improving its structured-ness (structure synthesis problem), ensuring data homogeneity (data mapping problem), and merging data from separate HTML pages (data integration problem). We discuss each problem in the following sections.

ANDES architecture

In this section we describe some operational features of the ANDES framework. All features described in Section 3 have been implemented in the framework and are in production use within IBM.

Conclusions and future work

In this paper we have discussed the problem of data extraction from Web sites and suggested an XML-based approach for solving it. We view the task of data extraction as a multi-step process where the goal extends far beyond simple “screen scraping”. Our ultimate goal is to be able to extract semistructured data from given Web sites and transform the data into a well-structured, feature-rich representation. Managing the heterogeneity of data retrieved from different Web sites is an integral part

Acknowledgements

The author would like to thank Jared Jackson and Stephen Dill of IBM Almaden Research Center, Yan Zhou of IBM China Development Laboratory, and Dorine Yelton, John Rees, and Douglas Griswold of IBM Global Services, for their contributions to the ideas and software presented in this paper.

Jussi Myllymaki is a Research Staff Member in the Web Technologies Department at the IBM Almaden Research Center in San Jose, California. He received his M.S. degree in Industrial Management from Helsinki University of Technology, Finland, and his M.S. and Ph.D. degrees in Computer Science from the University of Wisconsin at Madison. Dr. Myllymaki's early work focused on the performance evaluation of tertiary storage devices and database systems. His current work ranges from Web search engine

References (20)

  • C. Allen

    WIDL Application integration with XML

    World Wide Web Journal

    (1997)
  • N. Ashish, C. Knoblock, Wrapper generation for semi-structured Internet sources, in: Proceedings ACM SIGMOD Workshop on...
  • M.L. Barja, T. Bratvold, J. Myllymaki, G. Sonnenberger, Informia: A mediator for integrated access to heterogeneous...
  • S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, J. Widom, The TSIMMIS project:...
  • IBM Corporation, DB2 XML Extender...
  • BrightPlanet.com, DeepWeb white paper,...
  • C. Knoblock, S. Minton, J.L. Ambite, N. Ashish, P. Modi, I. Muslea, A. Philpot, S. Tejada, Modeling Web sources for...
  • L. Liu, C. Pu, W. Han, XWRAP: an XML-enabled wrapper construction system for Web information sources, in: Proceedings...
  • A. Sahuguet, F. Azavant, Building light-weight wrappers far legacy Web data-sources using W4F, in: Proceedings of the...
  • B. Schechter

    Information on the fast track

    IBM Research Magazine

    (1997)
There are more references available in the full text version of this article.

Cited by (43)

  • Collecting data on textiles from the internet using web crawling and web scraping tools

    2021, Forensic Science International
    Citation Excerpt :

    Among these computing techniques, a whole category relies on robots that are coded to browse (bot crawler) and collect data (bot scraper) on the internet. These tools are extremely useful for obtaining relevant data in a systematic and automated way, which can help creating structured databases [22–24]. The ethics and legality of these procedures has been questioned recently [25].

  • Physician Rating Websites: Do Radiologists Have an Online Presence?

    2015, Journal of the American College of Radiology
    Citation Excerpt :

    When rating websites occasionally indicated a specialty other than diagnostic radiology for the selected physicians, we reviewed NCH SAF for the most-frequent service claims submitted by those physicians, to reconcile the inconsistency between the website reported specialty and that self-designated to Medicare. Using a custom-built online “data-scraping” algorithm, similar to that used for other web data–extraction exercises [14], the content of all posted health care payer, facility, physician, and other nonphysician provider reviews was reconstructed, in early 2014, from a single health care ratings website (www.HealthcareReviews.com). That site was chosen because its data format was amenable to automated extraction, and it had no data-mining prohibition in its terms-of-service agreement.

  • Investigation of developer's perceptions in xml schema development using textual and visual tool types

    2014, International Journal of Software Engineering and Knowledge Engineering
View all citing articles on Scopus

Jussi Myllymaki is a Research Staff Member in the Web Technologies Department at the IBM Almaden Research Center in San Jose, California. He received his M.S. degree in Industrial Management from Helsinki University of Technology, Finland, and his M.S. and Ph.D. degrees in Computer Science from the University of Wisconsin at Madison. Dr. Myllymaki's early work focused on the performance evaluation of tertiary storage devices and database systems. His current work ranges from Web search engine technology and Web data extraction to location-based services and dynamic location data management.

View full text