Effective Web data extraction with standard XML technologies
Introduction
Given the rapid growth and success of public information sources on the World Wide Web, it is increasingly attractive to extract data from these sources and make it available for further processing by end users and application programs. Data extracted from Web sites can serve as the springboard for a variety of tasks, including information retrieval (e.g., business intelligence), event monitoring (news and stock market), and electronic commerce (shopping comparison).
Extracting structured data from Web sites is not a trivial task. Most of the information on the Web today is in the form of Hypertext Markup Language (HTML) documents which are viewed by humans with a browser. HTML documents are sometimes written by hand, sometimes with the aid of HTML tools. Given that the format of HTML documents is designed for presentation purposes, not automated extraction, and the fact that most of the HTML content on the Web is ill-formed (“broken”), extracting data from such documents can be compared to the task of extracting structure from unstructured documents.
In the future, some if not most Web content may be available in formats more suitable for automated processing, in particular the Extensible Markup Language (XML) [17]. Despite being a relatively new development, XML has become absolutely essential for enabling data interchange between otherwise incompatible systems. However, the volume of XML content available on the Web today is still miniscule compared to that of HTML. It is therefore reasonable (and profitable) to study ways of translating existing HTML content to XML, and thereby expose more Web sites to automated processing by end users and application programs. The tools and techniques that we collectively know as Web data extraction are key to making this possible.
In this paper we focus on system-oriented issues in Web data extraction and describe our approach for building a dependable extraction process. Our ideas are manifested in ANDES1, a crawler-based Web data extraction framework and the backbone of several Web data extraction systems in production use at IBM.
Section snippets
Related work
Several research groups have focused on the problem of extracting structured data from HTML documents. Much of the research is in the context of a database system, and the focus is on wrappers that translate a database query to a Web request and parse the resulting HTML page. Our focus is on batch-oriented data extraction: crawling target Web sites, extracting structured data, performing domain-specific feature extraction and resolution of missing and conflicting data, and making the data
Extracting structured data from Web sites
Extracting structured data from Web sites requires solving five distinct problems: finding target HTML pages on a site by following hyperlinks (navigation problem), extracting relevant pieces of data from these pages (data extraction problem), distilling the data and improving its structured-ness (structure synthesis problem), ensuring data homogeneity (data mapping problem), and merging data from separate HTML pages (data integration problem). We discuss each problem in the following sections.
ANDES architecture
In this section we describe some operational features of the ANDES framework. All features described in Section 3 have been implemented in the framework and are in production use within IBM.
Conclusions and future work
In this paper we have discussed the problem of data extraction from Web sites and suggested an XML-based approach for solving it. We view the task of data extraction as a multi-step process where the goal extends far beyond simple “screen scraping”. Our ultimate goal is to be able to extract semistructured data from given Web sites and transform the data into a well-structured, feature-rich representation. Managing the heterogeneity of data retrieved from different Web sites is an integral part
Acknowledgements
The author would like to thank Jared Jackson and Stephen Dill of IBM Almaden Research Center, Yan Zhou of IBM China Development Laboratory, and Dorine Yelton, John Rees, and Douglas Griswold of IBM Global Services, for their contributions to the ideas and software presented in this paper.
Jussi Myllymaki is a Research Staff Member in the Web Technologies Department at the IBM Almaden Research Center in San Jose, California. He received his M.S. degree in Industrial Management from Helsinki University of Technology, Finland, and his M.S. and Ph.D. degrees in Computer Science from the University of Wisconsin at Madison. Dr. Myllymaki's early work focused on the performance evaluation of tertiary storage devices and database systems. His current work ranges from Web search engine
References (20)
WIDL Application integration with XML
World Wide Web Journal
(1997)- N. Ashish, C. Knoblock, Wrapper generation for semi-structured Internet sources, in: Proceedings ACM SIGMOD Workshop on...
- M.L. Barja, T. Bratvold, J. Myllymaki, G. Sonnenberger, Informia: A mediator for integrated access to heterogeneous...
- S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, J. Widom, The TSIMMIS project:...
- IBM Corporation, DB2 XML Extender...
- BrightPlanet.com, DeepWeb white paper,...
- C. Knoblock, S. Minton, J.L. Ambite, N. Ashish, P. Modi, I. Muslea, A. Philpot, S. Tejada, Modeling Web sources for...
- L. Liu, C. Pu, W. Han, XWRAP: an XML-enabled wrapper construction system for Web information sources, in: Proceedings...
- A. Sahuguet, F. Azavant, Building light-weight wrappers far legacy Web data-sources using W4F, in: Proceedings of the...
Information on the fast track
IBM Research Magazine
(1997)
Cited by (43)
Collecting data on textiles from the internet using web crawling and web scraping tools
2021, Forensic Science InternationalCitation Excerpt :Among these computing techniques, a whole category relies on robots that are coded to browse (bot crawler) and collect data (bot scraper) on the internet. These tools are extremely useful for obtaining relevant data in a systematic and automated way, which can help creating structured databases [22–24]. The ethics and legality of these procedures has been questioned recently [25].
Physician Rating Websites: Do Radiologists Have an Online Presence?
2015, Journal of the American College of RadiologyCitation Excerpt :When rating websites occasionally indicated a specialty other than diagnostic radiology for the selected physicians, we reviewed NCH SAF for the most-frequent service claims submitted by those physicians, to reconcile the inconsistency between the website reported specialty and that self-designated to Medicare. Using a custom-built online “data-scraping” algorithm, similar to that used for other web data–extraction exercises [14], the content of all posted health care payer, facility, physician, and other nonphysician provider reviews was reconstructed, in early 2014, from a single health care ratings website (www.HealthcareReviews.com). That site was chosen because its data format was amenable to automated extraction, and it had no data-mining prohibition in its terms-of-service agreement.
Investigation of developer's perceptions in xml schema development using textual and visual tool types
2014, International Journal of Software Engineering and Knowledge EngineeringAutomatic news-roundup generation using clustering, extraction, and presentation
2020, Multimedia SystemsPersonalized content extraction and text classification using effective web scraping techniques
2019, International Journal of Web Portals
Jussi Myllymaki is a Research Staff Member in the Web Technologies Department at the IBM Almaden Research Center in San Jose, California. He received his M.S. degree in Industrial Management from Helsinki University of Technology, Finland, and his M.S. and Ph.D. degrees in Computer Science from the University of Wisconsin at Madison. Dr. Myllymaki's early work focused on the performance evaluation of tertiary storage devices and database systems. His current work ranges from Web search engine technology and Web data extraction to location-based services and dynamic location data management.