Identifying relevant constraints for semantic query optimization

https://doi.org/10.1016/S0950-5849(00)00121-XGet rights and content

Abstract

Semantic query optimization is the process of utilizing information implied by integrity constraints to reformulate the query into one that generates the same set of answers in a more efficient way. The difficulties of identifying relevant integrity constraints for a given query have been well recognized as have the various solutions. However, most of the previous works consider the query consisting of join(s) of base relations and the integrity constraints on base relations only. We generalize these restrictions and propose a method of identifying relevant integrity constraints for queries involving any combinations of joins and unions of base and defined relations. Our method utilizes a query graph that can be constructed dynamically during the query processing time, and, as a consequence, does not rely on heavy preprocessing or normalization. The method is extended to include the use of heuristics for generating a subset of answers.

Introduction

A database is required to be maintained in a valid state in which data values are restricted by certain constraints, namely, integrity constraints. These constraints, often referred to as semantic knowledge, can be functional dependencies, range restrictions, or even complicated rules that define relationships between attributes of relations. Such information can be used in enhancing the efficiency of query processing through a process known as Semantic Query Optimization (SQO). Formally, let Cl,…,Cn be integrity constraints (ICs). If the database is in a valid state, then a query Q is logically equivalent to QCl∧…∧Cn [10]. By applying transformations that preserve equivalence, different forms of the query can be generated. Among these we hope to choose the least expensive one to process.

Deductive database is a very convenient paradigm in which to analyze the interactions between the queries and ICs. The queries, relations, and ICs can all be represented in a single formalism, thus allowing full interactions. Using a relation, Teaches(Professor, Dept, Course, Lab), the query, “find professors who teach courses with programming lab”, and the integrity constraint, “only professors in the computer science department can teach courses with programming lab”, can be represented by Qa and IC1, respectively.

  • Qa: Teaches(P, _,_,program) / P is a free variable, _ is a don't care condition/

  • IC1: DI=‘cs’←Teaches(P1, D1, C1, ‘program) / variables are universally quantified/

Then, in a valid database state, Qa is equivalent to
  • Teaches(P. _,_, ‘program’)∧∀(D1, P1, C1) (D1=‘cs’Teaches(P1, D1, C1, ‘program’))

Applying resolution [15], we can obtain another form of the query where a restriction D=‘cs’ has been added.
  • Teaches(P, _,_,program)D=cs

Consider another IC, IC2, shown below, stating that a student and his/her advisor must be in the same department. IC2 is not relevant to the above query since evaluation of Qa does not involve any of faculty or student.
  • IC2: Fdept=SdeptStudent(S#, Sname, Sdept, Advisor), Faculty(F#, Fname, Fdept),

    • Advisor=F#

IC2 is also not relevant to the following query, Qb, asking for F#s of faculty members who are either in the computer science (cs) department or advising cs students. IC2 is relevant only to queries involving the join of the relations Student and Faculty. In Qb, the two relations are involved in a union operation, thus making IC2 irrelevant to Qb.
  • Qb: Faculty(F#, _, ‘cs’)Student(_,_, ‘cs’, F#)

As these examples suggest, the task of selecting from the set of ICs the ones relevant for the given query is not trivial. In a large database, it is expected that the number of ICs is large enough to make the running of a theorem prover not practical at all. The problem becomes even more complicated when views, or intensional database (IDB) relations, are involved in the query. Most of the previous works in SQO [2], [8], [10], [12], [18] have considered only join queries with no unions. IDB relations tend to introduce unions. In Ref. [2], when a query involves an IDB relation, it is compiled into a set of join queries on extensional database (EDB) relations and each partial query is processed individually. It is easy to see that processing individual partial queries may cause some redundancy and, moreover, the number of such partial queries will grow exponentially, making the query reformulation process even more complicated.

In this paper, we introduce a method of efficiently identifying relevant ICs for a given query at query processing time. The query may have any number of joins and unions of relations. We utilize a query graph in order to identify a set of relevant ICs for a given query. The head of a relevant IC can be used as a valid condition on the original query. Our objective is to collect, as many as possible and as efficiently as possible, such valid conditions from the set of ICs. The stage of actually reformulating the query using this information from the relevant ICs is beyond the scope of this paper (for related work, see [10], [18]).

Another type of knowledge that can be utilized in query processing is heuristics. Heuristics, when represented in Horn clauses, are similar in structure to integrity constraints. On the other hand, they differ from integrity constraints in that they represent guesses, hunches, or properties that are not verified. So, in query processing, use of such types of knowledge may result in an incomplete set of answers. Retrieving only a subset of answers from the database is often referred to as few answer retrieval (FAR). We propose vertical and horizontal partitioning techniques to support FAR. We also show how backtracking can be used to guarantee completeness in case the full set of answers for the original query is requested.

The remainder of this paper is organized as follows. In Section 2, we state our problem and define our scope. In Section 3, we introduce a way of selecting relevant ICs for a query involving joins and unions of only EDB relations. We generalize this method to queries involving IDB relations in Section 4. Issues on management and application of information obtained from relevant ICs are discussed in Section 5. We extend our discussions to heuristics in Section 6. Simulation results are presented in Section 7, and conclusions in Section 8.

Section snippets

The database

We assume the reader is familiar with terms and concepts in database query processing [9] and deductive databases [6]. The database consists of three major components; the EDB, IDB and the set of semantic knowledge clauses. We shall use the term database predicate to denote a predicate that represents a database relation, EDB or IDB. Other predicates (=, <, >, etc.) will be collectively called built-in predicates. Because complex objects will complicate unification and operations in the EDB

Queries with joins/unions of EDB relations

Let us assume that our query has a combination of joins and unions on EDB relations. If an IC contains only one database predicate in its body, it is relevant to the given query only if the predicate is in the query cluster. This is rather trivial so we will not consider ICs with only one database predicate in the body (such as I1 in Example 3). Furthermore, to simplify our discussions, we will consider only ICs with exactly two database predicates in the body. Generalization to ICs with more

Queries with IDB relations

The presence of IDB relations in a query complicates the task of selecting relevant ICs in two ways. First, it generates extensions which are logically equivalent to the query. Thus, an IC that may not seem relevant to the query itself may prove to be relevant to an extension of the query. This means the PRC selection process must consider not only the query expression itself but also its extensions, as shown in Example 1. Second, an IDB relation defined by more than one IDB rules introduces

Scope of a restriction

When the body of a remainder is satisfied, the head becomes a valid restriction. These restrictions are used in satisfying still other PRCs. These relationships are depicted in Fig. 5. The primary objective of the semantic analysis stage is to maximize the set of restrictions SR. A collected restriction can be applied only to the relevant portion of the query tree. We call this the scope of a restriction. We present in this section, how this scope can be practically managed.

Extension to few answer retrieval

An artificial intelligence application generally deals with a problem that seeks the best answer or few answers as opposed to seeking all answers, which is typical to a database application. A subset of possible answers should be found when not all answers are required and it is more efficient to do so, or only answers that satisfy certain qualifications (good, high, large, etc.) are wanted. This concept, which we refer to as few answer retrieval (FAR), has been actively pursued, although under

Simulations

A set of simulations have been run in order to demonstrate the performance characteristics of our proposal. The output measurement, which we simply call runtime (measured in milliseconds), is the average time it takes to identify the relevant constraints and collects restrictions for a query. The database contains 20 EDB predicates (base tables), 100 IDB rules, and 500 knowledge clauses (ICs and heuristics). The average number of database predicates in an IDB rule is 3 and that of knowledge

Conclusions

We have presented a way of identifying relevant knowledge clauses for a given query that may involve joins and unions on EDB and IDB relations. Our only assumption on the structure of the query is that the order of actual operations faithfully follows the structure of the given query (or equivalently, query tree). In other words, our approach is applicable to any structure of an AND/OR query without alteration. An alternative way of collecting relevant ICs for a join/union query would be to

References (22)

  • M.L. Barja et al.

    Design and implementation of rock and roll: a deductive object-oriented database system

    Information Systems

    (1995)
  • U.S. Chakravarthy et al.

    Foundations of semantic query optimization for deductive databases

  • C.L. Chang et al.

    Symbolic Logic and Mechanical Theorem Proving

    (1973)
  • W.F. Clocksin et al.

    Programming in Prolog

    (1984)
  • B. Freitag, O. Biernath, An Airtravel Expert Database, Proceedings of Third IEEE, International Conference on Data and...
  • H. Gallaire et al.

    Logic and databases: a deductive approach

    ACM Computing Surveys

    (1984)
  • J. Han, L.J. Henschen, W. Lu, Search strategies for finding partial answers in large knowledge-bases, Proceedings of...
  • M.M. Hammer, S.B. Zdonik, Knowledge based query processing. Proceedings of Sixth International Conference on VLDB, VLDB...
  • M. Jarke et al.

    Query optimization in database systems

    ACM Computing Surveys

    (1984)
  • J.J. King, Query Optimization by Semantic Reasoning, PhD Thesis, Department of CS, Stanford University, Stanford,...
  • R. Kung, E. Hanson, Y. Ioannidis, T. Sellis, L. Shapiro, M. Stonebraker, Heuristic Search in Data Base Systems,...
  • Cited by (0)

    View full text