Adaptive relaxation for querying heterogeneous XML data sources

doi:10.1016/j.is.2010.02.002

Information Systems

Volume 35, Issue 6, September 2010, Pages 688-707

https://doi.org/10.1016/j.is.2010.02.002 Get rights and content

Abstract

Searching XML data with a structured XML query can improve the precision of results compared with a keyword search. However, the structural heterogeneity of the large number of XML data sources makes it difficult to answer the structured query exactly. As such, query relaxation is necessary. Previous work on XML query relaxation poses the problem of unnecessary computation of a big number of unqualified relaxed queries. To address this issue, we propose an adaptive relaxation approach which relaxes a query against different data sources differently based on their conformed schemas. In this paper, we present a set of techniques that supports this approach, which includes schema-aware relaxation rules for relaxing a query adaptively, a weighted model for ranking relaxed queries, and algorithms for adaptive relaxation of a query and top-k query processing. We discuss results from a comprehensive set of experiments that show the effectiveness and the efficiency of our approach.

Introduction

As XML becomes the standard for representing web data, people are now publishing a large volume of data on the internet using XML for various purposes. For example, universities publish their course and research data for attracting students; travel and real estate agents publish their flight and property data for advertisement; stock brokers and car dealers publish stock and car information for online business; public service providers publish data such as tourist attractions and publications for providing information. As such, there is an increasing need to search and query XML data. Compared with a keyword search, a structured XML query allows a user to formulate the search requests more precisely. However, the structural heterogeneity of the potentially large number of XML data sources makes it difficult to answer a structured query exactly. The loosely coupled nature of the data sources also makes it inapplicable for deploying the traditional federated database approach for integrating the XML data sources by defining a global schema. It would be ideal that a query could be smartly relaxed then be answered according to the data sources against which the query is issued.

Amer-Yahia et al. [1], [2] proposed a framework FleXPath for relaxing XML tree pattern queries (TPQs). Given a TPQ q, the closure of the structural and value-based predicates in q is first inferred and then is used to generate relaxed queries. The set of generated queries, including the one that includes the root of q, contains all possible relaxed queries. However, the relaxation process is basically blind and wild and the number of relaxed queries could be big. For a large number of heterogeneous XML data sources, many of the generated relaxed queries could be unqualified and will result in unnecessary cost of either computing or testing them.

As an example, we may issue a query against XML data sources maintained in all Australian universities for searching those departments that have a group running project with a name containing “xml” and having publications with a title containing “query relaxation”. As the number of universities is large and their data source structures may vary, users normally formulate their queries against a domain schema according to the common understanding of a university. Here a domain schema bears similarity to a global schema. However, unlike a federated database, such a domain schema and its mapping from data source schemas may not be physically defined. So we cannot borrow the global-to-local query rewriting techniques in the context of data integration [4], [5]. Fig. 1 shows the query q represented as a TPQ and it reflects the user's structural and value-based search requirements. Solely based on the query itself, FleXPath may need to consider 2⁵ options, each could be a relaxed query that may be executed or tested against the university data sources. Some generated relaxed queries may be either too blind for some data sources thus return zero answers or too wild thus return answers that are far from what a user is expected. For example, the partial structures of the data source s₁ and s₂ for two universities are shown in schema d₁ in Fig. 2 and schema d₂ in Fig. 3, respectively. Obviously, the query itself will not return any result, and many relaxed queries will be generated by FleXPath from 2⁵ options and then be evaluated or tested for both data sources. For example, among the relaxed queries, some of the useless relaxed queries for s₂ are listed in Fig. 4. Actually, q₂ and q₃ in Fig. 4 are also useless for s₁.

To deal with this problem, we propose an adaptive query relaxation (AQR) approach, which relaxes a query adaptively to each XML data source according to its conformed schema. Hence each relaxed query will be guaranteed to agree with the structural constraints imposed by the conformed schema of the data source, and as a result, has higher probability of generating answers compared with FleXPath. For example, for schema d₁ in Fig. 2 and schema d₂ in Fig. 3, the relaxed queries generated by AQR are shown in Fig. 5, Fig. 6, respectively.

AQR avoids blind relaxation. Each generated relaxed query for an XML data source is specific to the data source. In other words, a relaxed query that does not satisfy the structural constraints imposed by the conformed schema will not be generated. This is similar to semantic query optimisation where a query that contradicts with an integrity constraint defined in the underlying schema may not need to be evaluated. For example, for data source s₂, query q₁ in Fig. 4 is useless and will not return any result because the edge between group and project in q₁ does not match d₂.

AQR also avoids wild relaxation. No unnecessary relaxation is needed because of the requirement that a data source has to conform with its schema. For example, the *-node project in d₂ implies the co-existence of project and pname. As such, for s₂, query q₂ in Fig. 4 is too wild compared with query q₂₁ in Fig. 6. In other words, after q₂₁ is generated and evaluated, the time spent on generating and computing q₂ is unnecessary because no new result will be returned.

As a large number of data sources may be evaluated and the relaxed queries generated for these data sources may be different, it is desirable to first execute the relaxed query that is closest to the original query such that the most relevant results can be returned first. In other words, the most relevant results are returned from the closest relaxed query in our system. This is especially important to evaluate a top-k query [3], [6]. For example, query q₁₁ in Fig. 5 is the closest query to the original query q in all the generated relaxed queries in Fig. 5, Fig. 6. In AQR, a top-k query is processed by incrementally evaluating the relaxed queries in their ranking order. Instead of ranking returned results, we rank the relaxed queries. To compare how much a relaxed query is close to the original query, we propose a penalty based ranking model to measure the difference between a relaxed query and the original query in AQR. For example, if the penalty for relaxing “/” to “//” is 0.1, we can compute the penalties of q₁₁ and q₁₂ in Fig. 5 as 2 and 5, and the penalties of q₂₁ and q₂₂ in Fig. 6 as 2.3 and 5.3. So q₁₁ is the least penalized query or the closest query to the original query q. The details for computing the penalties can be found in Section 4.

To improve the accuracy and relevancy of the results, we allow a user to specify weights on edges of a query and thus incorporate a weight into the ranking model. If the relationship between two nodes is less important than others, a smaller weight may be specified compared with the maximum weight 1. Our ranking model is based on the weight set on the original query and the penalty derived for a relaxed query. The ranking score of a relaxed query is calculated as the difference between the query weight for the original query and the penalty of the relaxed query. For example, when the weights for all edges are set to 1, the weight of the original query is 11. We know that the penalty for q₁₁, q₁₂, q₂₁ and q₂₂ are 2, 5, 2.3 and 5.3, so the scores for them are 9, 6, 8.7 and 5.7, respectively, i.e., the ranking list is [q₁₁, q₂₁, q₁₂, q₂₂]. If a user thinks that the relationship for a project under a group is less important and likes to decrease the weight of the edge between group and project in q to 0.5 while keeping other edges as 1, the ranking list will be changed to [q₂₁, q₁₁, q₂₂, q₁₂]. The details for computing these ranking scores can also be found in Section 4.

In case the schema is not available for a data source, its structural information can be generated dynamically with data summarization tools [7], [8]. In this paper, without loss of generality, we take DTD as the schema of XML data.

In summary, we claim the following contributions in this paper:

•
We propose and formalize the adaptive XML query relaxation problem w.r.t. different DTDs and devise a set of schema-aware relaxation rules.
•
We develop a weight modification and penalty evaluation model to assess to what extent the original query is relaxed.
•
We design a set of algorithms to describe how the rules and penalty model are leveraged in the process of relaxing queries.
•
We provide a scheduling strategy to incrementally evaluate relaxed queries across multiple data sources.
•
We run extensive experiments on XMark Benchmark to justify the efficiency and validity of our adaptive relaxation approach.

The rest of the paper is organized as follows. We give an overview of our AQR in Section 2. Section 3 discusses the relaxation rules in detail. Section 4 provides our weight modification and penalty evaluation models. The detailed descriptions of our adaptive relaxation algorithm are provided in Section 5. Section 6 proposes strategies to schedule the relaxed queries during query evaluation. We present the results of extensive experiments in Section 7. A brief survey of related work and the conclusions of this work are given in 8 Related work, 9 Conclusions, respectively.

Section snippets

Overview

The goal of query relaxation is to relax the query constraints such that approximate answers can be returned if the original query returns no answer or not enough answers. This is especially useful when we query a big number of heterogeneous XML data sources using a single structured XML query. Given a TPQ q, FleXPath generates relaxed queries by enumerating all possible combinations starting from q itself to the root of q, thus resulting in large number of relaxed queries. Among these queries,

Adaptive relaxation rules

As discussed above, AQR relaxes a query for an XML data source based on its conformed DTD. Basically, AQR avoids blind relaxation by filtering out those query nodes that do not appear in the DTD and adjusting the node relationships if they do not match the DTD; AQR also avoids wild relaxation by preserving the query requirements which are definitely satisfied by the DTD. Before we introduce the set of adaptive relaxation rules for these purposes, we need the following definitions.

Definition 3 Corresponding node

Let a WTPQ q =

Weight and penalty

In order to improve the precision of a user specified query, we allow users to assign weights to edges in the query to show their preferences for different paths. Surely, the default weight 1 will be taken if users do not have preferences. The weight information will serve as a foundation of associating each relaxed query with a reasonable penalty. Obviously, a less modified query with a low penalty is supposed to capture the user's original query aim more accurately.

Adaptive relaxation process

Given a WTPQ q and a certain DTD d_i, we can relax q and generate a set of relaxed queries Q_i using the relaxation rules discussed in Section 3. To get Q_i, we can first relax q and generate a relaxed query q_i that preserves maximum query requirements of q w.r.t. d_i. In other words, q_i receives the minimal penalty w.r.t. d_i. Based on q_i, we can then generate other relaxed queries in Q_i according to the cardinality and disjunctive information provided in d_i. In this section, we introduce the

Top-k query evaluation

Given a WTPQ q and a set of heterogeneous data sources s₁, s₂, …, s_n conforming to a set of DTDs d₁, d₂, …, d_n, respectively, we can use the relaxation algorithm introduced in Section 5 to generate a set of relaxed queries q₁, q₂, …, q_n together with their query weights w_e(q₁), w_e(q₂), …, w_e(q_n). We denote these query weights as score(q₁), score(q₂), …, score(q_n) in this section. Each $q_{i} (1 \leq i \leq n)$ is the relaxed query that preserves the maximum query requirements of q w.r.t. d_i and serves as a

Experiments

We ran the experiments on an Intel P4 3 GHz PC with 512 M memory. Wutka DTDparser [12] was used to analyze the source DTDs and extract their structural information. All relaxed queries were evaluated as XPath patterns in Oracle Berkeley DB XML [13].

Dataset and queries: We used XMark XML data generator [14] to create a set of XML documents with different size from 5 to 40 MB, which conform to auction.dtd [15]. These XML documents can be used to test the efficiency of AQR. In order to compare the

Related work

Query relaxations on structure have been studied recently. Some approaches propose to relax queries that return no result. Delobel and Rousset [16] define three kinds of relaxations: unfolding a node (replicating a node by creating a separate path to one of its children), deleting a condition at a node, and propagating a condition to its parent node. Schlieder [17] considers relaxations on an XQL query: deleting nodes for making the context loose, inserting a node between inner nodes for

Conclusions

In this paper, we presented a novel query relaxation approach—AQR that adaptively relaxed a query to different XML data sources based on their conformed DTDs and users’ intentions. A set of schema-aware relaxation rules were designed, and a pertinent penalty model based on weight modification was developed. Adaptive relaxation algorithms and strategies for top-k query evaluation were implemented and illustrated through a comprehensive set of experiments to show the effectiveness and efficiency

Acknowledgements

We would like to thank anonymous reviewers for their helpful comments on this article. This work was supported partly by the Australian Research Council Discovery Project under the Grant no. DP0878405 and the Research Grant Council of the Hong Kong SAR, China under the Grant no. 419109.

References (32)

S. Amer-Yahia, S. Cho, D. Srivastava, Tree pattern relaxation, in: EDBT, 2002, pp....
S. Amer-Yahia, L.V.S. Lakshmanan, S. Pandit, FleXPath: flexible structure and full-text querying for XML, in: SIGMOD,...
A. Marian, S. Amer-Yahia, N. Koudas, D. Srivastava, Adaptive processing of top-K queries in XML, in: ICDE, 2005, pp....
A.Y. Halevy, Answering queries using views: a survey, VLDB J. (2001)...
A.P. Sheth, J.A. Larson, Federated database systems for managing distributed, heterogeneous, and autonomous databases,...
B. Ding, J.X. Yu, S. Wang, L. Qin, X. Zhang, X. Lin, Finding top-k min-cost connected trees in databases, in: ICDE,...
G.J. Bex, F. Neven, T. Schwentick, K. Tuyls, Inference of concise DTDs from XML data, in: VLDB, 2006, pp....
G.J. Bex, F. Neven, S. Vansummeren, Inferring XML schema definitions from XML data, in: VLDB, 2007, pp....
H. Weinblatt, A new search algorithm for finding the simple cycles of a finite directed graph, JACM 19(1) (1972)...
B. Choi, G. Cong, W. Fan, S.D. Viglas, Updating recursive XML views of relations, in: ICDE, 2007, pp....

T.H. Cormen et al.

Introduction to Algorithms

(2001)

Wutka DTD parser...

Oracle Berkeley DB XML 2.3...

XMark XML data generator...

A. Schmidt, F. Waas, M. Kersten, M. Carey, I. Manolescu, R. Busse, XMark: a benchmark for XML data management, in:...

C. Delobel, M.-C. Rousset, A uniform approach for querying large tree-structured data through a mediated schema, in:...

Cited by (21)

Approximation and relaxation of semantic web path queries
2016, Journal of Web Semantics
Citation Excerpt :
The approach is grounded in a statistics-based model, in contrast to our work. Work has been done on relaxing tree pattern queries for XML, e.g. in [31–33] and more recently in [34]. Liu et al. [34]
Given the heterogeneity of complex graph data on the web, such as RDF linked data, it is likely that a user wishing to query such data will lack full knowledge of the structure of the data and of its irregularities. Hence, providing flexible querying capabilities that assist users in formulating their information seeking requirements is highly desirable. In this paper we undertake a detailed theoretical investigation of query approximation, query relaxation, and their combination, for this purpose. The query language we adopt comprises conjunctions of regular path queries, thus encompassing recent extensions to SPARQL to allow for querying paths in graphs using regular expressions (SPARQL 1.1). To this language we add standard notions of query approximation based on edit distance, as well as query relaxation based on RDFS inference rules. We show how both of these notions can be integrated into a single theoretical framework and we provide incremental evaluation algorithms that run in polynomial time in the size of the query and the data, returning answers in ranked order of their ‘distance’ from the original query. We also combine for the first time these two disparate notions into a single ‘flex’ operation that simultaneously applies both approximation and relaxation to a query conjunct, providing even greater flexibility for users, but still retaining polynomial time evaluation complexity and the ability to return query answers in ranked order.
XML filtering with XPath expressions containing parent and ancestor axes
2012, Information Sciences
Citation Excerpt :
The SDI system performs the matching task and ensures timely delivery of published data to all interested subscribers. With XML becoming the standard of data representation and exchange on the Internet, effective and efficient methods have been studied for searching useful information from ordinary and probabilistic XML documents by both structured queries and keyword queries [2,7,25,26,28,30,33,42]. XML is also adopted for content-based publish/subscribe systems because published XML messages have flexible document structures and subscription rules can be expressed by a powerful language such as XPath [9] and XQuery [5].
More and more XML data is generated and used for data exchange. In this paper, we address the problem of filtering XML documents with large number of XPath expressions, which may contain ‘ancestor’ and ‘parent’ axes. XPath expressions with these axes are more powerful and flexible for users to describe their interests in publish/subscribe systems. First, we analyze the characteristics of the ‘parent’ axis and propose a series of rules to eliminate it in XPath expressions. Then we propose a new index structure called NIndex, which is designed to efficiently store and index large number of XPath expressions. NIndex offers several features which make it especially attractive for the large scale selective dissemination of information, including the ability to handle complex XPath expressions with ‘ancestor’ and ‘parent’ axes, and efficient pruning. Based on NIndex, we design a new filtering algorithm with low complexity for our problem. Our experiment results show that our algorithm performs well across a range of XPath expressions and documents.
Uncertain spatiotemporal data management for the semantic web
2024, Uncertain Spatiotemporal Data Management for the Semantic Web
Optimisation Techniques for Flexible SPARQL Queries
2022, ACM Transactions on the Web
Query Relaxation and Result Ranking for Uncertain Spatiotemporal XML Data
2022, Journal of Database Management
Approximation and Relaxation of Semantic Web Path Queries
2018, SSRN

View all citing articles on Scopus

View full text