Elsevier

Information Systems

Volume 35, Issue 6, September 2010, Pages 688-707
Information Systems

Adaptive relaxation for querying heterogeneous XML data sources

https://doi.org/10.1016/j.is.2010.02.002Get rights and content

Abstract

Searching XML data with a structured XML query can improve the precision of results compared with a keyword search. However, the structural heterogeneity of the large number of XML data sources makes it difficult to answer the structured query exactly. As such, query relaxation is necessary. Previous work on XML query relaxation poses the problem of unnecessary computation of a big number of unqualified relaxed queries. To address this issue, we propose an adaptive relaxation approach which relaxes a query against different data sources differently based on their conformed schemas. In this paper, we present a set of techniques that supports this approach, which includes schema-aware relaxation rules for relaxing a query adaptively, a weighted model for ranking relaxed queries, and algorithms for adaptive relaxation of a query and top-k query processing. We discuss results from a comprehensive set of experiments that show the effectiveness and the efficiency of our approach.

Introduction

As XML becomes the standard for representing web data, people are now publishing a large volume of data on the internet using XML for various purposes. For example, universities publish their course and research data for attracting students; travel and real estate agents publish their flight and property data for advertisement; stock brokers and car dealers publish stock and car information for online business; public service providers publish data such as tourist attractions and publications for providing information. As such, there is an increasing need to search and query XML data. Compared with a keyword search, a structured XML query allows a user to formulate the search requests more precisely. However, the structural heterogeneity of the potentially large number of XML data sources makes it difficult to answer a structured query exactly. The loosely coupled nature of the data sources also makes it inapplicable for deploying the traditional federated database approach for integrating the XML data sources by defining a global schema. It would be ideal that a query could be smartly relaxed then be answered according to the data sources against which the query is issued.

Amer-Yahia et al. [1], [2] proposed a framework FleXPath for relaxing XML tree pattern queries (TPQs). Given a TPQ q, the closure of the structural and value-based predicates in q is first inferred and then is used to generate relaxed queries. The set of generated queries, including the one that includes the root of q, contains all possible relaxed queries. However, the relaxation process is basically blind and wild and the number of relaxed queries could be big. For a large number of heterogeneous XML data sources, many of the generated relaxed queries could be unqualified and will result in unnecessary cost of either computing or testing them.

As an example, we may issue a query against XML data sources maintained in all Australian universities for searching those departments that have a group running project with a name containing “xml” and having publications with a title containing “query relaxation”. As the number of universities is large and their data source structures may vary, users normally formulate their queries against a domain schema according to the common understanding of a university. Here a domain schema bears similarity to a global schema. However, unlike a federated database, such a domain schema and its mapping from data source schemas may not be physically defined. So we cannot borrow the global-to-local query rewriting techniques in the context of data integration [4], [5]. Fig. 1 shows the query q represented as a TPQ and it reflects the user's structural and value-based search requirements. Solely based on the query itself, FleXPath may need to consider 25 options, each could be a relaxed query that may be executed or tested against the university data sources. Some generated relaxed queries may be either too blind for some data sources thus return zero answers or too wild thus return answers that are far from what a user is expected. For example, the partial structures of the data source s1 and s2 for two universities are shown in schema d1 in Fig. 2 and schema d2 in Fig. 3, respectively. Obviously, the query itself will not return any result, and many relaxed queries will be generated by FleXPath from 25 options and then be evaluated or tested for both data sources. For example, among the relaxed queries, some of the useless relaxed queries for s2 are listed in Fig. 4. Actually, q2 and q3 in Fig. 4 are also useless for s1.

To deal with this problem, we propose an adaptive query relaxation (AQR) approach, which relaxes a query adaptively to each XML data source according to its conformed schema. Hence each relaxed query will be guaranteed to agree with the structural constraints imposed by the conformed schema of the data source, and as a result, has higher probability of generating answers compared with FleXPath. For example, for schema d1 in Fig. 2 and schema d2 in Fig. 3, the relaxed queries generated by AQR are shown in Fig. 5, Fig. 6, respectively.

AQR avoids blind relaxation. Each generated relaxed query for an XML data source is specific to the data source. In other words, a relaxed query that does not satisfy the structural constraints imposed by the conformed schema will not be generated. This is similar to semantic query optimisation where a query that contradicts with an integrity constraint defined in the underlying schema may not need to be evaluated. For example, for data source s2, query q1 in Fig. 4 is useless and will not return any result because the edge between group and project in q1 does not match d2.

AQR also avoids wild relaxation. No unnecessary relaxation is needed because of the requirement that a data source has to conform with its schema. For example, the *-node project in d2 implies the co-existence of project and pname. As such, for s2, query q2 in Fig. 4 is too wild compared with query q21 in Fig. 6. In other words, after q21 is generated and evaluated, the time spent on generating and computing q2 is unnecessary because no new result will be returned.

As a large number of data sources may be evaluated and the relaxed queries generated for these data sources may be different, it is desirable to first execute the relaxed query that is closest to the original query such that the most relevant results can be returned first. In other words, the most relevant results are returned from the closest relaxed query in our system. This is especially important to evaluate a top-k query [3], [6]. For example, query q11 in Fig. 5 is the closest query to the original query q in all the generated relaxed queries in Fig. 5, Fig. 6. In AQR, a top-k query is processed by incrementally evaluating the relaxed queries in their ranking order. Instead of ranking returned results, we rank the relaxed queries. To compare how much a relaxed query is close to the original query, we propose a penalty based ranking model to measure the difference between a relaxed query and the original query in AQR. For example, if the penalty for relaxing “/” to “//” is 0.1, we can compute the penalties of q11 and q12 in Fig. 5 as 2 and 5, and the penalties of q21 and q22 in Fig. 6 as 2.3 and 5.3. So q11 is the least penalized query or the closest query to the original query q. The details for computing the penalties can be found in Section 4.

To improve the accuracy and relevancy of the results, we allow a user to specify weights on edges of a query and thus incorporate a weight into the ranking model. If the relationship between two nodes is less important than others, a smaller weight may be specified compared with the maximum weight 1. Our ranking model is based on the weight set on the original query and the penalty derived for a relaxed query. The ranking score of a relaxed query is calculated as the difference between the query weight for the original query and the penalty of the relaxed query. For example, when the weights for all edges are set to 1, the weight of the original query is 11. We know that the penalty for q11, q12, q21 and q22 are 2, 5, 2.3 and 5.3, so the scores for them are 9, 6, 8.7 and 5.7, respectively, i.e., the ranking list is [q11, q21, q12, q22]. If a user thinks that the relationship for a project under a group is less important and likes to decrease the weight of the edge between group and project in q to 0.5 while keeping other edges as 1, the ranking list will be changed to [q21, q11, q22, q12]. The details for computing these ranking scores can also be found in Section 4.

In case the schema is not available for a data source, its structural information can be generated dynamically with data summarization tools [7], [8]. In this paper, without loss of generality, we take DTD as the schema of XML data.

In summary, we claim the following contributions in this paper:

  • We propose and formalize the adaptive XML query relaxation problem w.r.t. different DTDs and devise a set of schema-aware relaxation rules.

  • We develop a weight modification and penalty evaluation model to assess to what extent the original query is relaxed.

  • We design a set of algorithms to describe how the rules and penalty model are leveraged in the process of relaxing queries.

  • We provide a scheduling strategy to incrementally evaluate relaxed queries across multiple data sources.

  • We run extensive experiments on XMark Benchmark to justify the efficiency and validity of our adaptive relaxation approach.

The rest of the paper is organized as follows. We give an overview of our AQR in Section 2. Section 3 discusses the relaxation rules in detail. Section 4 provides our weight modification and penalty evaluation models. The detailed descriptions of our adaptive relaxation algorithm are provided in Section 5. Section 6 proposes strategies to schedule the relaxed queries during query evaluation. We present the results of extensive experiments in Section 7. A brief survey of related work and the conclusions of this work are given in 8 Related work, 9 Conclusions, respectively.

Section snippets

Overview

The goal of query relaxation is to relax the query constraints such that approximate answers can be returned if the original query returns no answer or not enough answers. This is especially useful when we query a big number of heterogeneous XML data sources using a single structured XML query. Given a TPQ q, FleXPath generates relaxed queries by enumerating all possible combinations starting from q itself to the root of q, thus resulting in large number of relaxed queries. Among these queries,

Adaptive relaxation rules

As discussed above, AQR relaxes a query for an XML data source based on its conformed DTD. Basically, AQR avoids blind relaxation by filtering out those query nodes that do not appear in the DTD and adjusting the node relationships if they do not match the DTD; AQR also avoids wild relaxation by preserving the query requirements which are definitely satisfied by the DTD. Before we introduce the set of adaptive relaxation rules for these purposes, we need the following definitions.

Definition 3 Corresponding node

Let a WTPQ q =

Weight and penalty

In order to improve the precision of a user specified query, we allow users to assign weights to edges in the query to show their preferences for different paths. Surely, the default weight 1 will be taken if users do not have preferences. The weight information will serve as a foundation of associating each relaxed query with a reasonable penalty. Obviously, a less modified query with a low penalty is supposed to capture the user's original query aim more accurately.

Adaptive relaxation process

Given a WTPQ q and a certain DTD di, we can relax q and generate a set of relaxed queries Qi using the relaxation rules discussed in Section 3. To get Qi, we can first relax q and generate a relaxed query qi that preserves maximum query requirements of q w.r.t. di. In other words, qi receives the minimal penalty w.r.t. di. Based on qi, we can then generate other relaxed queries in Qi according to the cardinality and disjunctive information provided in di. In this section, we introduce the

Top-k query evaluation

Given a WTPQ q and a set of heterogeneous data sources s1, s2, …, sn conforming to a set of DTDs d1, d2, …, dn, respectively, we can use the relaxation algorithm introduced in Section 5 to generate a set of relaxed queries q1, q2, …, qn together with their query weights we(q1), we(q2), …, we(qn). We denote these query weights as score(q1), score(q2), …, score(qn) in this section. Each qi(1in) is the relaxed query that preserves the maximum query requirements of q w.r.t. di and serves as a

Experiments

We ran the experiments on an Intel P4 3 GHz PC with 512 M memory. Wutka DTDparser [12] was used to analyze the source DTDs and extract their structural information. All relaxed queries were evaluated as XPath patterns in Oracle Berkeley DB XML [13].

Dataset and queries: We used XMark XML data generator [14] to create a set of XML documents with different size from 5 to 40 MB, which conform to auction.dtd [15]. These XML documents can be used to test the efficiency of AQR. In order to compare the

Related work

Query relaxations on structure have been studied recently. Some approaches propose to relax queries that return no result. Delobel and Rousset [16] define three kinds of relaxations: unfolding a node (replicating a node by creating a separate path to one of its children), deleting a condition at a node, and propagating a condition to its parent node. Schlieder [17] considers relaxations on an XQL query: deleting nodes for making the context loose, inserting a node between inner nodes for

Conclusions

In this paper, we presented a novel query relaxation approach—AQR that adaptively relaxed a query to different XML data sources based on their conformed DTDs and users’ intentions. A set of schema-aware relaxation rules were designed, and a pertinent penalty model based on weight modification was developed. Adaptive relaxation algorithms and strategies for top-k query evaluation were implemented and illustrated through a comprehensive set of experiments to show the effectiveness and efficiency

Acknowledgements

We would like to thank anonymous reviewers for their helpful comments on this article. This work was supported partly by the Australian Research Council Discovery Project under the Grant no. DP0878405 and the Research Grant Council of the Hong Kong SAR, China under the Grant no. 419109.

References (32)

  • S. Amer-Yahia, S. Cho, D. Srivastava, Tree pattern relaxation, in: EDBT, 2002, pp....
  • S. Amer-Yahia, L.V.S. Lakshmanan, S. Pandit, FleXPath: flexible structure and full-text querying for XML, in: SIGMOD,...
  • A. Marian, S. Amer-Yahia, N. Koudas, D. Srivastava, Adaptive processing of top-K queries in XML, in: ICDE, 2005, pp....
  • A.Y. Halevy, Answering queries using views: a survey, VLDB J. (2001)...
  • A.P. Sheth, J.A. Larson, Federated database systems for managing distributed, heterogeneous, and autonomous databases,...
  • B. Ding, J.X. Yu, S. Wang, L. Qin, X. Zhang, X. Lin, Finding top-k min-cost connected trees in databases, in: ICDE,...
  • G.J. Bex, F. Neven, T. Schwentick, K. Tuyls, Inference of concise DTDs from XML data, in: VLDB, 2006, pp....
  • G.J. Bex, F. Neven, S. Vansummeren, Inferring XML schema definitions from XML data, in: VLDB, 2007, pp....
  • H. Weinblatt, A new search algorithm for finding the simple cycles of a finite directed graph, JACM 19(1) (1972)...
  • B. Choi, G. Cong, W. Fan, S.D. Viglas, Updating recursive XML views of relations, in: ICDE, 2007, pp....
  • T.H. Cormen et al.

    Introduction to Algorithms

    (2001)
  • Wutka DTD parser...
  • Oracle Berkeley DB XML 2.3...
  • XMark XML data generator...
  • A. Schmidt, F. Waas, M. Kersten, M. Carey, I. Manolescu, R. Busse, XMark: a benchmark for XML data management, in:...
  • C. Delobel, M.-C. Rousset, A uniform approach for querying large tree-structured data through a mediated schema, in:...
  • Cited by (21)

    • Approximation and relaxation of semantic web path queries

      2016, Journal of Web Semantics
      Citation Excerpt :

      The approach is grounded in a statistics-based model, in contrast to our work. Work has been done on relaxing tree pattern queries for XML, e.g. in [31–33] and more recently in [34]. Liu et al. [34]

    • XML filtering with XPath expressions containing parent and ancestor axes

      2012, Information Sciences
      Citation Excerpt :

      The SDI system performs the matching task and ensures timely delivery of published data to all interested subscribers. With XML becoming the standard of data representation and exchange on the Internet, effective and efficient methods have been studied for searching useful information from ordinary and probabilistic XML documents by both structured queries and keyword queries [2,7,25,26,28,30,33,42]. XML is also adopted for content-based publish/subscribe systems because published XML messages have flexible document structures and subscription rules can be expressed by a powerful language such as XPath [9] and XQuery [5].

    • Uncertain spatiotemporal data management for the semantic web

      2024, Uncertain Spatiotemporal Data Management for the Semantic Web
    View all citing articles on Scopus
    View full text