Effective Web data extraction with standard XML technologies

doi:10.1016/S1389-1286(02)00214-1

Computer Networks

Volume 39, Issue 5, 5 August 2002, Pages 635-644

https://doi.org/10.1016/S1389-1286(02)00214-1 Get rights and content

Abstract

We describe an Extensible Markup Language (XML)-based methodology for Web data extraction that extends beyond simple “screen scraping”. An ideal data extraction process can digest target Web databases that are visible only as Hypertext Markup Language (HTML) pages, and create a local replica of those databases as a result. What is needed is more than a Web crawler and set of Web site wrappers. A comprehensive data extraction process must deal with such obstacles as session identifiers, HTML forms, client-side JavaScript, incompatible datasets and vocabularies, and missing and conflicting data. Proper data extraction also requires solid data validation and error recovery to handle data extraction failures. Our ANDES software framework helps solve these problems and provides a platform for building a production–quality Web data extraction process. Key aspects of ANDES are that it uses XML technologies for data extraction, including Extensible HTML and Extensible Stylesheet Language Transformations, and provides access to the “deep Web”.

Introduction

Given the rapid growth and success of public information sources on the World Wide Web, it is increasingly attractive to extract data from these sources and make it available for further processing by end users and application programs. Data extracted from Web sites can serve as the springboard for a variety of tasks, including information retrieval (e.g., business intelligence), event monitoring (news and stock market), and electronic commerce (shopping comparison).

Extracting structured data from Web sites is not a trivial task. Most of the information on the Web today is in the form of Hypertext Markup Language (HTML) documents which are viewed by humans with a browser. HTML documents are sometimes written by hand, sometimes with the aid of HTML tools. Given that the format of HTML documents is designed for presentation purposes, not automated extraction, and the fact that most of the HTML content on the Web is ill-formed (“broken”), extracting data from such documents can be compared to the task of extracting structure from unstructured documents.

In the future, some if not most Web content may be available in formats more suitable for automated processing, in particular the Extensible Markup Language (XML) [17]. Despite being a relatively new development, XML has become absolutely essential for enabling data interchange between otherwise incompatible systems. However, the volume of XML content available on the Web today is still miniscule compared to that of HTML. It is therefore reasonable (and profitable) to study ways of translating existing HTML content to XML, and thereby expose more Web sites to automated processing by end users and application programs. The tools and techniques that we collectively know as Web data extraction are key to making this possible.

In this paper we focus on system-oriented issues in Web data extraction and describe our approach for building a dependable extraction process. Our ideas are manifested in ANDES¹, a crawler-based Web data extraction framework and the backbone of several Web data extraction systems in production use at IBM.

Section snippets

Related work

Several research groups have focused on the problem of extracting structured data from HTML documents. Much of the research is in the context of a database system, and the focus is on wrappers that translate a database query to a Web request and parse the resulting HTML page. Our focus is on batch-oriented data extraction: crawling target Web sites, extracting structured data, performing domain-specific feature extraction and resolution of missing and conflicting data, and making the data

Extracting structured data from Web sites

Extracting structured data from Web sites requires solving five distinct problems: finding target HTML pages on a site by following hyperlinks (navigation problem), extracting relevant pieces of data from these pages (data extraction problem), distilling the data and improving its structured-ness (structure synthesis problem), ensuring data homogeneity (data mapping problem), and merging data from separate HTML pages (data integration problem). We discuss each problem in the following sections.

ANDES architecture

In this section we describe some operational features of the ANDES framework. All features described in Section 3 have been implemented in the framework and are in production use within IBM.

Conclusions and future work

In this paper we have discussed the problem of data extraction from Web sites and suggested an XML-based approach for solving it. We view the task of data extraction as a multi-step process where the goal extends far beyond simple “screen scraping”. Our ultimate goal is to be able to extract semistructured data from given Web sites and transform the data into a well-structured, feature-rich representation. Managing the heterogeneity of data retrieved from different Web sites is an integral part

Acknowledgements

The author would like to thank Jared Jackson and Stephen Dill of IBM Almaden Research Center, Yan Zhou of IBM China Development Laboratory, and Dorine Yelton, John Rees, and Douglas Griswold of IBM Global Services, for their contributions to the ideas and software presented in this paper.

Jussi Myllymaki is a Research Staff Member in the Web Technologies Department at the IBM Almaden Research Center in San Jose, California. He received his M.S. degree in Industrial Management from Helsinki University of Technology, Finland, and his M.S. and Ph.D. degrees in Computer Science from the University of Wisconsin at Madison. Dr. Myllymaki's early work focused on the performance evaluation of tertiary storage devices and database systems. His current work ranges from Web search engine

References (20)

C. Allen
WIDL Application integration with XML
World Wide Web Journal
(1997)
N. Ashish, C. Knoblock, Wrapper generation for semi-structured Internet sources, in: Proceedings ACM SIGMOD Workshop on...
M.L. Barja, T. Bratvold, J. Myllymaki, G. Sonnenberger, Informia: A mediator for integrated access to heterogeneous...
S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, J. Widom, The TSIMMIS project:...
IBM Corporation, DB2 XML Extender...
BrightPlanet.com, DeepWeb white paper,...
C. Knoblock, S. Minton, J.L. Ambite, N. Ashish, P. Modi, I. Muslea, A. Philpot, S. Tejada, Modeling Web sources for...
L. Liu, C. Pu, W. Han, XWRAP: an XML-enabled wrapper construction system for Web information sources, in: Proceedings...
A. Sahuguet, F. Azavant, Building light-weight wrappers far legacy Web data-sources using W4F, in: Proceedings of the...
B. Schechter
Information on the fast track
IBM Research Magazine
(1997)

There are more references available in the full text version of this article.

Cited by (43)

Collecting data on textiles from the internet using web crawling and web scraping tools
2021, Forensic Science International
Citation Excerpt :
Among these computing techniques, a whole category relies on robots that are coded to browse (bot crawler) and collect data (bot scraper) on the internet. These tools are extremely useful for obtaining relevant data in a systematic and automated way, which can help creating structured databases [22–24]. The ethics and legality of these procedures has been questioned recently [25].
Fibre population surveys are a necessary part of the forensic fibres examination field. They provide valuable information as to which fibres are the most popular and help estimate the likelihood of observing similar properties in a fibre unrelated to the event. The time needed to carry these types of studies is however a major obstacle to wider use. With the advent of e-commerce and digital computation, collecting information from digital sources and structuring it in a convenient way may provide meaningful information on fibres population. It has become more affordable for researchers who can now devote most of their time to extracting meaningful information from the structured data.
In this article, we have used a scrapy and kibana/elastic search interface to crawl and scrape a major online clothes retailer. In less than 24 h we have extracted 68 text-based field describing a total of 24,701 clothes to help provide precise estimations of fibres types and color frequencies. We were able to provide data that cotton, polyester, viscose and elastane are the 4 main types of fibres used in the textile industry. Elastane, while being very popular in garments, rarely accounts for more than 10% of the mass while cotton accounts for up to 80% of content. The most common colors are white, black, and blue, with important dependencies to the fibre type. Through further statistics and examples we demonstrate that web scraping techniques have the potential to provide near real-time population studies that can greatly benefit forensic practitioners.
Physician Rating Websites: Do Radiologists Have an Online Presence?
2015, Journal of the American College of Radiology
Citation Excerpt :
When rating websites occasionally indicated a specialty other than diagnostic radiology for the selected physicians, we reviewed NCH SAF for the most-frequent service claims submitted by those physicians, to reconcile the inconsistency between the website reported specialty and that self-designated to Medicare. Using a custom-built online “data-scraping” algorithm, similar to that used for other web data–extraction exercises [14], the content of all posted health care payer, facility, physician, and other nonphysician provider reviews was reconstructed, in early 2014, from a single health care ratings website (www.HealthcareReviews.com). That site was chosen because its data format was amenable to automated extraction, and it had no data-mining prohibition in its terms-of-service agreement.
Given that patient satisfaction and provider transparency intersect on online physician-rating websites, we aimed to assess radiologist representation on these increasingly popular sites.
From a directory of all Medicare participating physicians, we randomly selected 1,000 self-designated diagnostic radiologists and manually extracted their rating information from five popular online physician-review websites (HealthGrades, Healthcare Reviews, RateMDs, Kudzu, and Yelp). Using automated web “data-scraping” techniques, we separately extracted all radiologist and nonradiologist rating information from a single amenable site (Healthcare Reviews). Rating characteristics were analyzed.
Of 1,000 sampled self-designated diagnostic radiologists representing all 50 states, only 197 (19.7%) were profiled on any of the five online physician-review websites. Only 24 (2.4%) were rated on two of the sites, and none was profiled on ≥3 sites. Of all 6,775 physicians listed on a single electronically interrogated site, only 30 (0.4%) were radiologists. With 28,555 (5.2%) of all 547,849 Medicare-participating physicians identified as diagnostic radiologists, radiologists were thus significantly underrepresented online (P < .0001). Although reviewed radiologists and nonradiologists were rated online by similar numbers of patients (1.13 ± 0.43 versus 1.03 ± 0.22, P = .22), radiologists were rated (on a low to high score of 1 to 10) significantly higher than nonradiologists (median 8.5 versus 5, P = .04).
Most diagnostic radiologists are not profiled on common online physician-rating websites, and they are significantly underrepresented compared with nonradiologists. Reviewed radiologists, however, scored favorably. Given the potential for patient satisfaction scores and public domain information to affect referrals and future value-based payments, initiatives to enhance radiologists’ online presence are advised.
Investigation of developer's perceptions in xml schema development using textual and visual tool types
2014, International Journal of Software Engineering and Knowledge Engineering
Automatic news-roundup generation using clustering, extraction, and presentation
2020, Multimedia Systems
Personalized content extraction and text classification using effective web scraping techniques
2019, International Journal of Web Portals
Service wrapper: A system for converting web data into web services
2019, arXiv

View all citing articles on Scopus

View full text

Effective Web data extraction with standard XML technologies

Abstract

Introduction

Section snippets

Related work

Extracting structured data from Web sites

ANDES architecture

Conclusions and future work

Acknowledgements

WIDL Application integration with XML

World Wide Web Journal

Information on the fast track

IBM Research Magazine