View selection for designing the global data warehouse

https://doi.org/10.1016/S0169-023X(01)00041-6Get rights and content

Abstract

A global data warehouse (DW) integrates data from multiple distributed heterogeneous databases and other information sources. A global DW can be abstractly seen as a set of materialized views. The selection of views for materialization in a DW is an important decision in the design of a DW. Current commercial products do not provide tools for automatic DW design. We provide a general method that, given a set of select-project-join queries to be satisfied by the DW, generates sets of materialized views that satisfy all the input queries. This process is complex since `common subexpressions' between the queries need to be detected and exploited. Our method is then applied to solve the problem of selecting such a materialized view set that fits in the space allocated to the DW for materialization and minimizes the combined overall query evaluation and view maintenance cost. We design algorithms which are implemented and we report on their experimental evaluation.

Introduction

Data Warehousing is an approach for integrating data from multiple distributed heterogeneous databases and other information sources. In this approach data are extracted in advance from the sources and stored in a repository data warehouse (DW). When a query is issued against the DW, it is evaluated locally, without accessing the original information sources. DWs are used by companies for decision support applications. Therefore, ensuring high query performance is one of the most significant challenges when implementing a DW. In general, DWs are used for storing historical information. However, in this paper, we do not emphasize the historical aspect of a DW but rather we consider it as a repository of integrated data and we view it as a set of materialized views defined over source relations. When the source relations change, the materialized views need to be updated. Different maintenance policies (deferred or immediate) and maintenance strategies (incremental or rematerialization) can be applied [6]. This choice depends on the needs of the specific application for currency of the data and view maintenance performance.

A typical DW architecture [4] comprises three layers. The data at each layer is derived from the data of lower layers. At the lowest layer there are the distributed operational data sources. The central layer is the global or principal DW. The upper layer contains the local DWs or data marts. The global DW integrates the data from the distributed data sources. Data marts contain highly aggregated data for extensive analytical processing. They are also probably less frequently updated than global DWs. Here we deal with global DWs (local DW are tailored mainly for grouping/aggregation queries). In the following, DW refers to the global DW.

This paper deals with the selection of views for materialization in a DW. This is an important decision in the design of a DW. Current commercial products do not provide tools for automatic DW design. This issue can be formulated as a problem that takes as input a set of queries of interest defined over the source relations, to be satisfied by the DW, and produces a set of views to materialize that minimizes the overall query evaluation cost or view maintenance cost or a combination of both, under a number of resource constraints (for instance the space allocated for materialization in the DW).

The materialized view selection issue is complex. Its complexity is due to the fact that the input queries may contain subexpressions that are identical, equivalent or more generally subexpressions such that one can be computed from the other. We describe these subexpressions by the generic term common subexpressions. Common subexpressions in the input queries need to be detected and exploited in the view selection process, since they can significantly reduce the view maintenance cost and the space needed for materialization.

Most approaches dealing with the problem of selecting views to materialize in a DW avoid the issue of detecting common subexpressions between the input queries or views [7], [9], [13], [18], [28] by considering that the input to the view selection problem is not a set of query definitions but an AND/OR expression directed acyclic graph (dag) for multiple queries. AND/OR expression dags for multiple queries can represent common subexpressions to be taken into account in the materialized view selection process. Roussopoulos [19] provides initial work in the construction of such dags. Nevertheless, constructing these dags in a way that all `significant' common subexpressions are represented in them is an open problem.

Computing the input queries locally using exclusively the materialized views requires the existence of a complete rewriting of the queries over the materialized views [14]. Many formulations of the problem of selecting views to materialize in a DW [7], [9], [10], [28] do not explicitly impose this condition. Trivially, this requirement can be met by assuming that all the source relations necessary for answering the queries are available locally for computation. This can be achieved by: (a) considering a centralized DW environment, or (b) considering a distributed environment where all the source relations are replicated in the DW. Clearly, considering centralized DWs is a special instance of the problem since DWs are typically built over distributed data sources. Replicating the source relations entails an important waste of space and may not even be possible because of space restrictions. Further, in this case, the view maintenance cost is increased by the cost of applying to the replicated source relations every change performed to the source relations. These changes may not even affect the result of any query addressed to the DW.

In this paper we provide a general method that generates view selections from the input queries. The method detects and exploits common subexpressions among the input queries and guarantees the existence of a complete rewriting of the queries over the selected views. We then address the following problem (called DW design problem):

Given a set of queries to be satisfied by the DW, select a set of views to materialize in the DW such that:

  • 1.

    The materialized views fit in the space available at the DW.

  • 2.

    All the queries can be answered locally using this set of materialized views.

  • 3.

    The combination of the query evaluation cost and the view maintenance cost (operational cost) is minimal.

Based on the aforementioned method, we model the DW design problem as a state space search problem. We then design algorithms for solving it, and we report on their experimental evaluation.

The main contributions of this paper are the following:

  • We provide a set of transformation rules for select-project-join (SPJ) queries that, starting with the input query definitions, generate alternative view selections for materialization and a complete rewriting of the input queries over the selected views. We show that the proposed set of rules are sound and complete.

  • The transformation rules can be used to formulate different view selection problems as state space search problems.

  • Based on the transformation rule set we address the DW design problem. We show that the optimal solution can be found (if a solution to the problem exists) by defining equivalent states and exhaustively enumerating all equivalence classes.

  • Our approach is implemented for a class of SPJ-queries. The exhaustive algorithm can be very expensive for a large number of complex queries. Even though the design of a DW is not meant to be done frequently, we also design algorithms that significantly prune the search space using cost based and other heuristics.

  • The approach is general in that it does not assume that the source relations are available locally for computation (centralized environment) or that the source relations are replicated in the DW. Further, it does not depend on the way the query evaluation and view maintenance cost are assessed.

The rest of the paper is organized as follows: Section 2 presents related work. In Section 3, we formally state the DW design problem after providing some basic definitions. Section 4 introduces the transformation rules and shows their soundness and completeness. In Section 5, the DW design problem is modeled as a state space search problem. Section 6 discusses implementation issues and pruning algorithms. Experimental results are provided in Section 7, while Section 8 contains concluding remarks.

Section snippets

Related work

The view selection problem has been addressed by many authors in different contexts. Harinarayan et al. [10] provides algorithms for selecting a set of views to materialize that minimizes the query evaluation cost, under a space constraint, in the context of aggregations and multidimensional analysis. This work is extended in [8] where greedy algorithms are provided, in the same context, for selecting both views and indexes. In [7] greedy algorithms are provided for the view selection problem

Formal statement of the DW design problem

We consider that a finite, non-empty set Q of queries is given, defined over a set R of source relations. The DW contains a set V of materialized views over R such that every query in Q can be rewritten completely [14] over V. Thus, all the queries in Q can be answered locally at the DW, without accessing the source relations in R. Let Q be a query over R. By QV, we denote a complete rewriting of Q over V. By `complete' we mean that only views in V are involved in the rewriting of Q (hereafter,

Transformation rules

In order to model the DW design problem as a state space search problem we introduce a set of DW configuration transformation rules for a class of SPJ views and we show their soundness and completeness. These rules operate on a graph representation of multiple views.

Edge removal

The first transformation rule modifies a view by removing a selection or join edge from it in the multiquery graph. Attributes may be added to the attribute labels for that view on the nodes incident with the edge, in order to guarantee the rewriting of the initial view over the new one. This rewriting in turn guarantees the rewriting of the queries in QV over the new view set.
Transformation Rule 1: Let V be a view in GV, and e be an edge between nodes Ri and Rj labeled as V:p. Nodes Ri and Rj

Attribute removal

The second transformation rule modifies a view by eliminating attributes from attribute labels for this view in the multiquery graph. This rule can be used when in each tuple of the view and in any view materialization, the values for some attributes equal the values for other attributes.
Transformation Rule 2: Let V be a view in GV, X be the set of all the attributes appearing in attribute labels for V, and A1,…,Ak, B1,…,Bk be attributes in X. If the atoms A1=B1,…,Ak=Bk, where AiAj, ij, and Ai

View break

The third transformation rule creates two new views by splitting a view in the multiquery graph. The `old' view can be rewritten, using exclusively the new views.
Transformation Rule 3: Let V be a view, and N1,…,Nk, k⩾2, be sets of nodes labeled by V, in GV, such that:

  • (a) NiNj, ij, i,j∈[1,k].

  • (b) ⋃i∈[1,k]Ni is the set of all the nodes labeled by V in GV.

  • (c) There is no edge labeled by V in GV, between a node in Ni−⋃j∈[1,k],jiNj and a node in ⋃j∈[1,k],jiNjNi, i=1,…,k.


Then,
  • 1.

    GV

View merging

The fourth transformation rule creates one new view by merging two views in the multiquery graph. Each of the two `old' views can be rewritten using exclusively the new view.
Transformation Rule 4: Let V1 and V2 be two views in GV, and N1 and N2 be their sets of nodes, respectively such that:

  • (a) There is no edge labeled by V1 (V2) between a node in N1N2 (N2N1) and a node in N1N2.

  • (b) For every atom p of V1 (V2) labeling an edge between nodes in N1 (N2), p implies or is implied by an atom of V2

Attribute transfer

The fifth transformation rule eliminates some attributes from attribute labels for a view V in the multiquery graph and produces a new one-tuple view where these attributes are the only projected attributes. This rule is introduced to account for the case where, for every instance of the source relations, all the tuples of V have the same value for each of these attributes.
Transformation Rule 5 Let V be a view in GV, X be the set of all the attributes appearing in attribute labels for V, and {A1

The DW design problem as a state space search problem

We now model the DW design problem as a state space search problem. We define states and transitions between states.

A state s is a DW configuration C=〈GV,QV. A particular state, the state GQ,QQ, is called initial state and is denoted s0. With every state s a cost is associated through the function cost(s). This is the operational cost T(C) of the DW configuration C. Also, a size is associated with a state s through the function size(s). This is the space S(V) needed for materializing the

Implementation and algorithms

Our approach has been implemented for a class of SPJ-queries. For simplicity, queries and views are as defined in Section 4 but do not contain `pure' Cartesian products. Query rewritings over the materialized views are not allowed to contain Cartesian products and self-joins. The transformation rules are slightly modified to account for this class of queries, views and query rewritings. Attribute removal and attribute transfer rules are invalidated. We have implemented a simple cost model for

Experimental results

We have performed a sequence of experiments to compare the algorithms presented in Section 6. The comparison is first made with respect to the time required to find the solution. The time required by each algorithm is expressed as CPU time. We also study the greedy algorithm with respect to the quality of the solution returned. This is expressed as the percentage of the cost of the optimal state returned by the pruning algorithm divided by the cost of the final state that the greedy algorithm

Conclusion and future work

In this paper we have dealt with the issue of selecting a set of views to materialize in a DW. This decision is important when implementing a DW. Current products do not provide tools for automatic DW design.

We provide a general method that, given a set of SPJ queries to be answered by the DW, generates materialized view selections that answer all the input queries. This is done through the use of a set of transformation rules which detect and exploit common subexpressions between the queries.

Dimitri Theodoratos received a Diploma in Electrical Engineering from the National Technical University of Athens in 1985, a master's degree in Computer Science from the Ecole Nationale Superieure de Telecommunication of Paris, in 1986, and a Ph.D. degree in Computer Science from the University of Paris at Orsay in 1991. From 1993 to 1995 he was a European Union post-doc fellow at Rutherford Appleton Laboratory (RAL) in UK, and at Institut Nationale de Recherche en Informatique et Automatique

References (28)

  • M. Mohania et al.

    Making aggregate views self-maintainable

    Data and Knowledge Engineering

    (2000)
  • D. Theodoratos et al.

    Designing data warehouses

    Data and Knowledge Engineering

    (1999)
  • M.O. Akinde et al.

    Minimizing detail data in data warehouses

  • E. Baralis et al.

    Materialized view selection in a multidimensional database

  • J.A. Blakeley et al.

    Updating derived relations: detecting irrelevant and autonomously computable updates

    ACM Transactions on Database Systems

    (1989)
  • S. Chaudhuri et al.

    An overview of data warehousing and OLAP technology

    SIGMOD Record

    (1997)
  • A. Gupta et al.

    Data integration using self-maintainable views

  • A. Gupta et al.

    Maintenance of materialized views: problems, techniques and applications

    Data Engineering

    (1995)
  • H. Gupta

    Selection of views to materialize in a data warehouse

  • H. Gupta et al.

    Index selection for OLAP

  • H. Gupta et al.

    Selection of views to materialize under a maintenance cost constraint

  • V. Harinarayan et al.

    Implementing data cubes efficiently

  • R. Hull et al.

    A framework for supporting data integration using the materialized and virtual approaches

  • Y. Kotidis et al.

    DynaMat: a dynamic view management system for data warehouses

  • Cited by (46)

    • MR-MVPP: A map-reduce-based approach for creating MVPP in data warehouses for big data applications

      2021, Information Sciences
      Citation Excerpt :

      The query rewriting is conducted such that the queries have more shared sub-expressions and can use the results of each other. For example, some methods based on the rewriting of the queries were presented in [16,58–60]. Syntactical analysis of the workload is another technique used in the MVS methods [8].

    • On solving efficiently the view selection problem under bag and bag-set semantics

      2014, Information Systems
      Citation Excerpt :

      The view selection problem has received significant attention in many data-management scenarios, such as information integration [25], data warehousing [31,30], Web-site design [16], query optimization [29], and recently, database-as-a-service model [21] and data-placement in cloud systems [35].

    • Simultaneous determination of view selection and update policy with stochastic query and response time constraints

      2008, Information Sciences
      Citation Excerpt :

      Lastly, the query arrivals can be assumed as with constant rates, or as stochastic processes with variations which can be modeled by some probability distributions such as Poisson distribution. Most published results resolve the issues on systems without query-triggered updates and with fixed query arrival rates [5–11,15,21,22,24,31–35]. Only limited research has considered the query-triggered enabling systems with stochastic process [27].

    • A general model for the design of data warehouses

      2008, International Journal of Production Economics
      Citation Excerpt :

      Incompatibilities will be solved through a dialogue between the user and the system. The schema being strengthened then definitively, it is possible to use approaches by materialized views (as in Theodoratos et al. (2001) to make the connection between the sources and the repository and to solve the problems of data refreshment. We think that such facilities would permit an end user to handle himself the design and the manipulation of his data warehouse.

    • An improved framework for modelling data warehouse systems using uml profile

      2020, International Arab Journal of Information Technology
    View all citing articles on Scopus

    Dimitri Theodoratos received a Diploma in Electrical Engineering from the National Technical University of Athens in 1985, a master's degree in Computer Science from the Ecole Nationale Superieure de Telecommunication of Paris, in 1986, and a Ph.D. degree in Computer Science from the University of Paris at Orsay in 1991. From 1993 to 1995 he was a European Union post-doc fellow at Rutherford Appleton Laboratory (RAL) in UK, and at Institut Nationale de Recherche en Informatique et Automatique (INRIA) in Paris. Since, 1996 he is a research associate at the Knowledge and Data Base Systems Laboratory in the National Technical University of Athens. From 1996 to 2001 he taught different courses including Data Bases, Knowledge bases and Artificial Intelligence as an assistant professor at the Computer Science Department of the University of Ioannina and at the Information and Communication Systems Department of the University of Aegean in Greece. His current research interests include Data Bases, Data Warehousing and On-Line Analytical Processing. Dimitri Theodoratos has published several articles in refereed journals and international conferences. He has also served as a reviewer in journals and as a PC member in international conferences.

    Spyros Ligoudistianos received a B.Sc. degree from the Computer Science Department of the University of Crete, Greece, in 1993, and a Ph.D. degree from the National Technical University of Athens, Greece, in 1999. The title of his thesis was “Design and Operational Issues of Data Warehouse Systems”. He has published in international and national conferences and journals. Currently he is a researcher at the Knowledge and Database System Laboratory in the Electrical and Computer Engineering Department of the National Technical University of Athens. His research interests include Enterprise Data Warehouses, OLAP systems, and E-Business.

    Timos Sellis received his B.Sc. degree in Electrical Engineering in 1982 from the National Technical University of Athens, Athens, Greece. In 1983 he received the M.Sc. degree from Harvard University and in 1986 the Ph.D. degree from the University of California at Berkeley, where he was a member of the INGRES group, both in Computer Science. In 1986, he joined the Department of Computer Science of the University of Maryland, College Park as an Assistant Professor, and became an Associate Professor in 1992. Between 1992 and 1996 he was an Associate Professor at the Computer Science Division of the National Technical University of Athens (NTUA), in Athens, Greece, where he is currently a Full Professor. Timos Sellis is also the head of the Knowledge and Database Systems Laboratory at NTUA. His research interests include extended relational database systems, data warehouses, and spatial, image and multimedia database systems. He has published over 100 articles in refereed journals and international conferences in the above areas. Timos Sellis is a recipient of a Presidential Young Investigator (PYI) award for 1990–1995, and of the VLDB 1997 10 Year Paper Award together with N. Roussopoulos and C. Faloutsos. He is a member of the Editorial Boards of the International Journal on Intelligent Information Systems: Integrating Artificial Intelligence and Database Technologies, and Geoinformatica.

    Research supported by the European Commission under the ESPRIT program LTR project “DWQ: Foundations of Data Warehouse Quality” and by the General Secretariat of Research and Technology (Greece) under the Greece–France cooperation program PLATO.

    View full text