Schemas for web data: a reverse engineering approach

https://doi.org/10.1016/S0169-023X(01)00036-2Get rights and content

Abstract

In this paper, we show how to generate schemas of a set of HTML or XML documents retrieved from the web in the context of our web warehousing system called Whoweda (WareHouse Of WEb DAta). Web schemas are used to bind a web table that contains a collection of interlinked web documents called web tuples. These schemas specify the metadata, content and structural properties (in the form of predicates) shared by the web documents and hyperlinks in the web table. They also summarize the hyperlink structure of these documents using the notion of connectivities. Web schemas are generated in three stages. In the first stage, a simple or complex web schema is generated from the user's query (coupling query). In the next stage, the complex web schema is decomposed into a set of simple web schemas. These two stages are performed without inspecting the data instances, i.e., web tuples. Finally, in the last stage the set of simple web schemas are pruned by inspecting the hyperlink structure of the web tuples. We also discuss the formal algorithm for generating a set of simple web schemas from a coupling query.

Introduction

The exponential growth of the web in the last few years had a significant impact on the traditional techniques used for data management during the last few decades. This has compelled the database community to reuse traditional techniques wherever possible to manage web data. Unfortunately, due to the very nature of web data, it is not always possible to reuse conventional techniques effectively. This has led the database community to rethink and reuse existing techniques in a new way to address the current challenges. In this paper, we describe a novel technique for generating schemas of web data. We introduce the notion of web schema to model instances of warehouse data and show how it is generated in the context of our web warehousing system, called Whoweda (Warehouse Of Web Data) [3]. As will be seen, this issue is more challenging than the corresponding problem for relational schema due to the irregularity and incompleteness of data in the world wide web. Beyond its use to define the structure of a set of data in the warehouse, a web schema serves two important functions. It helps user in query formulation and aids the query processor for efficient execution of query [4].

There has been increasing research activities in generating schemas for semistructured data [2], [6], [7], [9], [10]. For instance, in [7], [8], the authors provide a structural summary that allows a semistructured database system (or a user of one) to quickly extract information about label paths in the database. In [10] a work on the extraction of implicit structure in semistructured data modeled in the style of [1] as directed, labeled graph is presented. Our approach differs from these works in the following ways.

Traditionally, a schema provides a structural summary of the data it binds. Query formulation and evaluation can be performed efficiently if some of the content and metadata properties shared by the web documents and hyperlinks are highlighted in the schema. HTML tags are not used for describing the data segment enclosed in it and hence structural summary of HTML pages is not very useful in subsequent query evaluation and formulation. Furthermore, capturing summary of the hyperlink structure of a set of web documents also helps us to formulate meaningful queries in the warehouse. Consequently, a web schema provides two types of information: first, it specifies some of the common properties shared by the documents and hyperlinks in the web table with respect to not only their structure, but also their metadata and content. Second, a web schema(s) summarizes the hyperlink structure of these documents. For instance, given a set of documents, the web schema(s) may specify that the title of all these documents contain the keyword “genetic disorder”. It may also specify that these documents belong to the web site at www.ninds.nih.gov and contain the tags symptom, treatment and drugs inside the tag disease. Also, a web schema may specify that a set of documents containing the keyword “genetic disorder” are directly linked to a set of documents having the tags drugs and side effects via a set of hyperlinks whose label contain the keyword “drugs”. In Section 3.1, we describe how the content, metadata and structural summary is incorporated in a web schema.

We take a reverse approach in generating a web schema. The standard database paradigm in schema generation involves first creating a schema to describe the structure of the database and then populating that database through the interface provided by the schema. A schema is then used to decide whether some new data fits the schema or whether a query is legal against the set of data. Hence, a schema is defined before the query. In our approach, a web schema is defined from a coupling query. A coupling query is specified by a user and is used to populate the web warehouse by retrieving relevant data from the web that matches the query. The results of such query is a set of directed graphs called web tuples and are stored in a web table. We justify this reverse approach now. If a web schema is defined by a user ahead of time, the structure, content and rigidity of a web schema depends on the following factors: first, the information a user wishes to retrieve from the web. Second, the user's level of knowledge of the content and structure of the web site(s) containing the relevant data. However, this conventional approach is not feasible because of the following reasons: first, it is unrealistic to assume from the user complete knowledge about the content and structure of web pages in web sites. The user knows what he/she is looking for, but he may not have a clear idea of how to express his needs. This is primarily because one may not necessarily know anything about the architecture of a given site, anything about the structure used to represent the data or how the desired information is represented in the source. Thus, a web schema generated by a user may contain very loose structure. Such schema does not represent the best schema for the data it binds. Second, even if a user is successful to find the schema of a set of documents by browsing the web site(s) manually, it is not a realistic approach. For reasonably small size of web documents, browsing the set of documents for information of schema is a feasible option. However, browsing document set of significant size is a tedious and inefficient way of determining the schema of the data. This problem may be minimized by specifying a coupling query based on partial knowledge of the data of interest and then generate the schema of these data from the query and the query results (web tuples) autonomously. We discuss this approach of schema generation in 4 Phase 1: valid canonical query to schema transformation, 5 Phase 2: complex schema decomposition, 6 Phase 3: Schema pruning.

Due to the unstructured and irregular nature of the web, there may exist a set of documents or links in the query result (web tuples) whose common characteristics are not known ahead of time, or they may not simply share any common properties. Moreover, there may also exist a collection of documents and links which does not share any common characteristics with respect to their connectivities with one another or other documents. Such documents and links are called free documents and links. Our web schema is flexible enough to represent these documents and links and how they are connected to one another. We discuss this in detail in Section 3.1.

The dynamic nature of the web further aggravates the problem of using traditional schema generation techniques. As new web data is added frequently to its sources, we may find a schema is incomplete or inconsistent. Consequently, DataGuides [7], [8] are recomputed, or incrementally updated, when the data changes. In Whoweda, a set of web schemas is associated with each web table. These schemas are not incrementally updated as the web tables are not modified [4]. We shall justify the reasons behind this in Section 3.3. Hence, the maintenance of web schemas is less complicated compared to DataGuides.

Recent approaches of schema generation for semistructured data [2], [6], [7], [9], [10] focus on generating schemas for XML-like documents. These approaches are useful for documents having user-defined tags. However, as mentioned earlier the structural summary provided by these approaches may not be always helpful for HTML documents. Our notion of web schema is generic. That is, it can be used for both HTML and XML documents.

The rest of the paper is organized as follows: in Section 2, we provide the framework for our subsequent discussion of schema generation. We formally introduce the notion of coupling query and illustrate two types of coupling queries, i.e., canonical and non-canonical queries with examples. We introduce formally the concept of web schema in Section 3. 4 Phase 1: valid canonical query to schema transformation, 5 Phase 2: complex schema decomposition, 6 Phase 3: Schema pruning describes generation of web schemas from coupling queries. Section 7 describes the formal algorithm for generation of a simple schema set. We discuss our work with respect to the existing work on schemas for semistructured data in Section 8. Section 9 concludes the paper. For brevity, a schema will mean a web schema (unless explicitly stated otherwise).

Section snippets

Framework

In order to populate the web warehouse, it is necessary to retrieve relevant documents from the web. This is performed in Whoweda by posing coupling query over the Web. This query is evaluated by the global web coupling operation [3] and a set of documents satisfying the query is garnered from the web. In this section, we introduce the notion of the coupling query. We begin by introducing the underlying data model of Whoweda. Then, we describe the components of a coupling query for expressing

Web schema

We are now ready to discuss the notion of web schema in detail. We begin by discussing the components of a web schema. Next, we categorize web schema into two types, i.e., simple and complex web schemas. Then, we describe the notion of web table. Finally, we conclude this section by providing an overview of the schema generation process.

Phase 1: valid canonical query to schema transformation

In this section, we discuss the first phase of the schema generation process. Specifically, we show how to create the components of a simple or complex web schema from a valid canonical coupling query. We call this process as the canonical query to schema transformation phase. Hereafter, a coupling query will mean a valid canonical coupling query (unless explicitly stated otherwise). Intuitively, the components Xn, X, C and P of the schema is generated from the corresponding components of the

Phase 2: complex schema decomposition

In this section, we discuss the second phase of the schema generation process. That is, how to decompose a complex web schema to a set of simple web schemas. We define this phase as complex schema decomposition phase. We begin by motivating this phase. Next, we discuss the process of generating a set of simple web schemas from a complex web schema. Finally, we identify the limitations of this phase.

Phase 3: Schema pruning

In this section, we discuss the final phase of schema generation, i.e., schema pruning. We begin by discussing the objectives of the schema pruning process. Next, we introduce some terms related to web schemas which we will be using to explain schema pruning process. Then, in Section 6.3, we informally introduce the schema pruning operation by illustrating few examples. Finally, 6.4 Phase 1: pre-processing phase, 6.5 Phase 2: matching phase, 6.6 Phase 3: non-overlapping partitioning phase

The complete algorithm – algorithm Schema Generator

We now provide the algorithm for generating a set of simple web schemas to bind the web tuples generated by the global web coupling operation. We focus on high-level procedures of the algorithm, deferring detailed discussion to the later portion of the section that present our techniques in detail. Fig. 5 provides the pseudo-code for the algorithm. It takes as input a canonical coupling query G, the name of the web table N and a set of web tuples retrieved from the Web (denoted as tupleSet in

Recent approaches in schema generation

In this section, we discuss some of the recent work in the area of schema generation of semistructured data and compare the our web schema with these efforts. The authors in [6] discusses schemas for graph-structured databases. A formal definition of a graph schema is given, along with an algorithm to determine whether a database conforms to a specific schema. This work is presented with a more traditional view of a schema that we take. Schema information for semistructured data translation is

Conclusions

In this paper, we have formally introduced the notion of web schemas and described how they are generated in Whoweda. As ongoing work we are looking into the following issues. (1) We are developing techniques to measure the goodness of a web schema. Schema goodness measures the quality of a schema with respect to the set of web tuples it binds. Note that by quality we mean how much information related to the metadata, content and structure of the node and link objects in the tuple set is

Sourav S. Bhowmick received his Ph.D. in Computer Engineering from Nanyang Technological University, Singapore in 2001. He is an Assistant Professor of the School of Computer Engineering at the Nanyang Technological University. He has published more than 25 journal and conference papers in the areas of web warehousing, web mining and mobile data management. He is serving as PC member of various database conferences and workshops and reviewer for various journals.

References (10)

  • S. Abiteboul et al.

    The lorel query language for semistructured data

    Journal of Digital Libraries

    (1997)
  • C. Beeri et al.

    Schemas for integration and translational of structured and semi-structured data

    Proceedings of the International Conference on Database Theory

    (1999)
  • S.S. Bhowmick, WHOM: A Data Model and Algebra for a Web Warehouse. Ph.D. Dissertation, School of Computer Engineering,...
  • S.S. Bhowmick et al.

    Web schemas in whoweda

    3rd ACM International Workshop on Data Warehousing and OLAP (DOLAP'00) (in conjunction with ACM CIKM '00), Washington, DC

    (November 2000)
  • S.S. Bhowmick et al.

    Imposing disjunctive constraints on inter-document structure using coupling queries

    Proceedings of the 12th International Conference on Database and Expert Systems (DEXA' 01), Munich

    (September 2001)
There are more references available in the full text version of this article.

Cited by (6)

  • Constraint-driven join processing in a Web warehouse

    2003, Data and Knowledge Engineering
    Citation Excerpt :

    To initiate global web coupling, a user specifies a coupling query as discussed earlier. The schemas Sm1 and Sm2 are generated from the coupling query Gi in Example 2 by following the steps discussed in [8]. Note that each web tuple in this table contains information about the description, side-effects, and usage as well as the manufacturers of a drug used for a particular disease.

  • Anatomy of the coupling query in a web warehouse

    2002, Information and Software Technology
  • Detecting and representing relevant Web deltas in WHOWEDA

    2003, IEEE Transactions on Knowledge and Data Engineering
  • Constraint-free join processing on hyperlinked web data

    2002, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Sourav S. Bhowmick received his Ph.D. in Computer Engineering from Nanyang Technological University, Singapore in 2001. He is an Assistant Professor of the School of Computer Engineering at the Nanyang Technological University. He has published more than 25 journal and conference papers in the areas of web warehousing, web mining and mobile data management. He is serving as PC member of various database conferences and workshops and reviewer for various journals.

Wee Keong Ng is an Assistant Professor of the School of Computer Engineering at the Nanyang Technological University, Singapore. He obtained his M.Sc. and Ph.D. degrees from the University of Michigan, Ann Arbor in 1994 and 1996, respectively. He works and publishes widely in the areas of Web warehousing, information extraction, electronic commerce and data mining. He has organized and chaired international workshops, including tutorials, and has actively served in the program committees of numerous international conferences. He is a member of the ACM and IEEE Computer Society.

Sanjay Kumar Madria received his Ph.D. in Computer Science from Indian Institute of Technology, Delhi, India in 1995. He is an Assistant Professor of the Department of Computer Science at the University of Missouri-Rolla, USA. Earlier he was Visiting Assistant Professor in the Department of Computer Science, Purdue University, West Lafayette, USA. He has also held appoinments at Nanyang Technological University in Singapore and University Sains Malaysia in Malaysia. He has published more than 50 Journal and conference papers in the areas of web warehousing, mobile databases, data warehousing, nested transaction management and performance issues. He is guest-editor of WWW Journal and Data and Knowledge Engineering for Special Issues on Web data management and Data warehousing. He is Program Chair for EC& WEB 2001 conference to be held in Germany in Sept. 2001. He was PC chair of ECWEB'00 and workshop chair for “Internet Data Management” workshop of Florence, Italy held in Sept. 1999. He is serving as PC member of various database conferences and workshops and reviewer for many reputed database journals. Dr. Madria has given tutorials on web warehousing and mobile databases in many international conferences. He is regular panelist in NSF. He was invited keynote speaker in Annual Computing Congress in October'99. He is an IEEE Senior Member.

View full text