Elsevier

Data & Knowledge Engineering

Volume 54, Issue 3, September 2005, Pages 279-299
Data & Knowledge Engineering

Clustering Web pages based on their structure

https://doi.org/10.1016/j.datak.2004.11.004Get rights and content

Abstract

Several techniques have been recently proposed to automatically generate Web wrappers, i.e., programs that extract data from HTML pages, and transform them into a more structured format, typically in XML. These techniques automatically induce a wrapper from a set of sample pages that share a common HTML template. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper inducer. Presently, the pages are chosen manually. In this paper, we tackle the problem of automatically discovering the main classes of pages offered by a site by exploring only a small yet representative portion of it. We propose a model to describe abstract structural features of HTML pages. Based on this model, we have developed an algorithm that accepts the URL of an entry point to a target Web site, visits a limited yet representative number of pages, and produces an accurate clustering of pages based on their structure. We have developed a prototype, which has been used to perform experiments on real-life Web sites.

Introduction

A large number of Web sites contain highly structured regions. The pages contained in these regions are generated automatically, either statically or dynamically, by programs that extract the data from a back-end database and embed them into an HTML template. As a consequence, pages generated by the same program exhibit common structure and layout, while differing in content.

Based on this observation, several researchers have recently proposed techniques that leverage the structural similarities of pages from large Web sites to automatically derive Web wrappers [8], [9], [29], [28], i.e., programs that extract data from HTML pages, and transform them into a machine processable format, typically in XML. These techniques take a small set of sample pages that exhibit a common template, and generate a wrapper that can be used to extract the data from any page that shares the same structure of the input samples.

Applying automatically generated wrappers on a large scale, i.e. to the structured portion of the Web, could anticipate some of the benefits advocated by the Semantic Web vision, because large amounts of data exposed throughout HTML Web sites could become available to applications. For example, the financial data published by several specialized Web sites only in HTML, could be constantly extracted and processed for mining purposes; data delivered on the Web by thematic communities could be extracted and integrated.

However, automatically building wrappers for a large number of Web sites presents several issues. Firstly, there is the problem of selecting the sample pages to feed the wrapper generation system, i.e., of identifying clusters of structurally homogeneous sample pages—this significantly affects the scalability of the approaches based on wrappers, because presently sample pages are selected manually. Second, once a library of wrappers for a Web site has been generated, given a target page the correct wrapper must be selected.

This paper addresses all these issues; we present a system that automatically discovers the main classes of pages by exploring a small yet representative portion of a site. The system produces a model of the site consisting of clusters of pages. The model is suitable for wrapping, as the pages of each cluster exhibit the required structural uniformity.

We now describe the overall approach by means of an example. Consider the official FIFA 2002 world cup Web site,1 whose roughly 20,000 pages contain information about teams, players, matches, and news. The site content is organized in a regular way; for example we find one page for each player, one page for each team, and so on. These pages are well-structured: for instance, all the player pages share the same structure and, at the intensional level, they present similar information (the name of the player, his current club, a short biography, etc.); moreover, all team pages share a common structure and a common intensional information, which are different from those of the players. Also, pages contain links to one another, in order to provide effective navigation paths that reflect semantic relationships; for example, every team page contains links to the pages of its players.

A key observation in our approach is that links reflect the regularity of the structure. Consider the Web page of Fig. 1, which is taken from the FIFA Web site and presents information about a national team. Observe that links are grouped in collections with uniform layout and presentation properties; we call these groups link collections.2 Usually, links in the same collection lead to similar pages. For example, the large table on the right side of the page contains links to player pages; the list located in the central part of the page has links to news pages. Also, we observe that these link collections appear in every team page. Similar properties hold for the pages in Fig. 2, which offer statistics about teams. Every stats page has a collection of links organized in the left most column of a large table; all these links point to player pages.

Based on these observations, we argue that:

  • it is reasonable to assume that links that share layout and presentation properties usually point to pages that are structurally similar. In our example, in a team page, the large table on the right contains links all pointing to player pages, while links in the central list actually lead to news pages;

  • the set of layout and presentation properties associated with the links of a page can be used to characterize the structure of the page itself. In other words, whenever two (or more) pages contain links that share the same layout and presentation properties, then it is likely that the two pages share the same structure. In the FIFA example, if two pages contain a link collection inside a large table on the right, and a collection of links inside a central list, then we may assume that the two pages are similar in structure (they look like the team page in Fig. 1).


Our approach relies on the above observations. Pages are modelled in terms of the link collections they offer, and the similarity of a group of pages is measured with respect to these features. We have designed and implemented an algorithm that creates clusters of pages with homogeneous structure, while crawling only a small portion of the site. The algorithm starts from an entry point, such as the home page, whose (singleton) page class represents the initial class, and then creates new clusters by iteratively exploring the outbound links. To minimize the number of pages to fetch, the algorithm exploits the properties of link collections. Pages reached from the same collection are assumed to form a uniform class of pages; at the same time, suitable techniques are applied for handling the situations where this assumption is violated.

The remainder of the paper is organized as follows. Section 2 illustrates our model in detail. Section 3 describes the algorithm for exploring the site and to cluster pages according to their structure. Section 4 reports the results of some experiments we have conducted over real-life Web sites. Section 5 discusses related works, and Section 6 concludes the paper.

Section snippets

Web site structure model

In this section we present our model to abstract the structure of a Web site, based on the main idea that layout and presentation properties associated with links can characterize the structure of a page.

For our purposes, a Web page is represented as a subset of the root-to-link paths in the corresponding DOM tree representation [1], along with the referenced URLs themselves.

Site-model generation algorithm

We have designed an algorithm that builds the site model incrementally, while crawling the site. The quality of the site model is evaluated with an information-theoretical approach based on the Minimum Description Length Principle (MDL) [24], [16].

Fig. 5 presents the pseudo code for the algorithm: the entry point to the site is a single given seed page, which becomes the first member of the first class in the model. Its link collections are extracted, and pushed into a priority queue. Then, the

Experiments

A working prototype of the algorithm has been used to conduct several experiments. To tune the algorithm parameters, we have run the system against two test sites with known structural properties. The first test site has been hand-crafted with an ad hoc structure; we used it for early experiments, specifically designed to test the algorithm implementation. The second site is the Web site of the teaching activities of our Department:8 starting

Related work

The issue of modelling the logical structure of Web sites for extraction purposes has been studied in several research projects. A pioneering approach is proposed in the araneus project [3], [2], where a Web page is considered as an object with an identifier (the URL) and a set of attributes. The notion of page scheme is then introduced to model sets of homogeneous pages. Attributes of a page may have simple or complex type. Simple attributes correspond to text, images or links to other pages;

Conclusions and further work

In this paper we have presented an algorithm to cluster pages from a data intensive Web site, based on the page structure. The structural similarity among pages is defined with respect to their DOM trees. The algorithm identifies the main classes of pages offered by the site by visiting a small yet representative number of pages. The resulting clustering can be used to build a model that describes the structure of the site in terms of classes of pages and links among them.

The model can be used

Valter Crescenzi received his Laurea degree in Computer Engineering from Università Roma Tre, Italy in 1998, and his Ph.D. degree from Università di Roma La Sapienza, Italy in 2002. He is currently research assistant at Università Roma Tre. His research interests focus on information extraction from Web data sources.

References (29)

  • Document Object Model (DOM) Level 1 specification, W3C Recommendation, http://www.w3.org/TR/REC-DOM-level-1 (October...
  • P. Atzeni et al.

    Managing web-based data: Database models and transformations.

    IEEE Internet Computing

    (2002)
  • P. Atzeni, G. Mecca, P. Merialdo, To Weave the Web, in: Proceedings of 23rd International Conference on Very Large Data...
  • G.O. Arocena et al.

    Weboql: Restructuring documents, databases, and webs

    TAPOS—Theory and Practice of Object Systems

    (1999)
  • Z. Bar-Yossef, S. Rajagopalan, Template detection via data mining and its applications, in: Proceedings of the 11th...
  • S. Chakrabarti et al.

    Mining the web’s link structure

    Computer

    (1999)
  • S. Chakrabarti et al.

    Focused crawling: a new approach to topic-specific Web resource discovery

    Computer Networks (Amsterdam, Netherlands)

    (1999)
  • C.-H. Chang, S.-C. Lui, Iepad: information extraction based on pattern discovery, in: Proceedings of the Tenth...
  • V. Crescenzi, G. Mecca, P. Merialdo, roadRunner: Towards automatic data extraction from large Web sites, in:...
  • V. Crescenzi, G. Mecca, P. Merialdo, Wrapping-oriented classification of web pages, in: Proceedings of the ACM...
  • V. Crescenzi, P. Merialdo, P. Missier, Fine-grain web site structure discovery, in: Proceedings of the 5th ACM...
  • J. Dean et al.

    Finding related pages in the world wide web

    Computer Networks

    (1999)
  • S. Flesca, G. Manco, E. Masciari, L. Pontieri, A. Pugliese, Detecting structural similarities between xml documents,...
  • M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, K. Shim, XTRACT: A system for extracting document type descriptors...
  • Valter Crescenzi received his Laurea degree in Computer Engineering from Università Roma Tre, Italy in 1998, and his Ph.D. degree from Università di Roma La Sapienza, Italy in 2002. He is currently research assistant at Università Roma Tre. His research interests focus on information extraction from Web data sources.

    Paolo Merialdo received his Laurea degree in Computer Engineering from Università di Genova, Italy in 1990, and his Ph.D. degree from the Università di Roma “La Sapienza”, Italy in 1998. He is currently research assistant at Università Roma Tre, Italy. His research interests focus on methods, models and tools for the management of data for Web-based information systems and, more recently, on information extraction from Web data sources.

    Paolo Missier has extensive research and industrial experience in the area of data management and software architectures. He has worked as a Research Scientist at Telcordia Technologies (formerly Bellcore), NJ, USA for eight years and later as an independent collaborator on multiple research projects with Università Roma Tre in Rome, and Università Milano Bicocca, in Italy, where he has also been teaching. He currently works as a researcher on data management issues for bioinformatics at the University of Manchester, UK, Department of Computer Science. He received a M.Sc. in Computer Science from University of Houston, Tx., USA in 1993 and a B.Sc. and M.Sc. in Computer Science from Università di Udine, Italy in 1990. ACM and SIGKDD member since 1998, IEEE member since 2001.

    View full text