Elsevier

Information Systems

Volume 32, Issue 2, April 2007, Pages 248-282
Information Systems

Partitioned optimization of complex queries

https://doi.org/10.1016/j.is.2005.09.003Get rights and content

Abstract

Performing complex analysis on top of massive data stores is essential to most modern enterprises and organizations and requires significant aggregation over different attribute sets (dimensions) of the participating relations. Such queries may take hours or days, a time period unacceptable in most cases. As a result, it is important to study these queries and identify special frequent cases that can be evaluated with specialized algorithms. Understanding complex aggregate queries leads to better execution plans and, consequently, performance.

The idea of partitioning is fundamental and central in aggregate queries. This concept can be used to define a class of queries called group queries. The main characteristic of a group query is that it can be evaluated in a partitioned (or groupwise) fashion, i.e. the underlying relation(s) can be partitioned (based on a set of attributes) into disjoint groups and each group can be processed separately, possibly in parallel. For example, a query that performs a complex operation (e.g. joins and/or selections and/or aggregations) within each group is a group query. To express it in SQL, one has to join/correlate several views and/or subqueries on the grouping attributes. A naive plan (where the joins are executed) may be very expensive, even for relatively small base relations. On the other hand, a groupwise evaluation can lead to huge performance gains. We present a syntactic criterion to identify group queries in SQL and show that every group query can be expressed in a way that satisfies this criterion. This work is based on Chatziantoniou and Ross [Querying Multiple Features of Groups in Relational Databases. in: 22nd International Conference on Very Large Databases, VLDB, 1996, pp. 295–306].

The concept of group queries is useful not only in terms of evaluation, but also in terms of analyzing a complex decision support query that aggregates over different sets of attributes. In such a case the query may be decomposable to one or more query components, where each component is a group query. This observation allows parallel execution, multi-query processing and identification of special cases. We present in this paper two algorithms to decompose a complex aggregate query to its group query components.

The value of groupwise processing has been recently recognized by the research community and implemented in at least a major commercial system. To be of use however in a relational system, partitioned evaluation has to be modeled as a relational operator. We review three different approaches for such an operator and propose a generalized groupwise operator. We also perform some experiments to show that naive optimization with the new operator incorporated without taking into consideration decompositions to group query components does not always lead to the most efficient plans.

An extended syntax is another way to identify special frequent cases and apply efficient algorithms. Having specific operators for common operations contributes to the succinctness and optimizability of certain queries (e.g. datacubes). An extended syntax is presented with emphasis for multi-feature queries, a frequent and practical subclass of group queries that is amenable to specialized evaluation, involving (potentially repeated) selection, grouping and aggregation over the same groups.

Introduction

In a declarative query language the user specifies what to compute and not how to compute it. Declarativeness is an attractive feature since writing queries in a such language can be significantly easier for a non-programmer and thus less time-consuming. Usually there is an underlying algebra or a set of operators that declarative queries translate to. Optimization techniques for these languages are highly desirable since naive evaluations take too long.

This article deals with complex aggregate queries in relational databases. A complex aggregate query is a query with several query blocks (views or subqueries) correlated together with multiple joins. Most of these query blocks involve grouping and aggregation. In this work subqueries are not considered since very often these can be transformed to an equivalent expression using views [2], [3]. Complex aggregate queries are essential in decision support and data warehousing applications. Being able to execute efficiently and optimize properly these queries can lead to huge performance gains [1], [4], [5]. As a result, a significant number of papers exist in the literature dealing with optimization of aggregate queries [4], [6], [7], [8], [9], [10], [11].

Not long ago, when people were talking about aggregation they meant grouping of a relation R and applying an aggregate function (like average or min) on each group. That is not the case in today's applications such as decision support or data mining. Users are not any longer interested in just the average salary of each department. They want to use that average salary to get the employees of the departments that are paid over their departmental average. Then they want to group further these employees based on their marital status or sex. In other words users want to perform complex processing within each group and possibly grouping among different sets of attributes. Optimization of these queries is different from traditional techniques. Special models (multi-dimensional models [12], [13], [14], [15]), databases (multi-dimensional databases [16]) and operators [17], [18], [19], [20], [21] have been proposed to treat these individual characteristics of decision support queries efficiently. However, relational systems represent a big fraction of today's systems and SQL is the standard query language of these systems. Further, multi-dimensional systems are less flexible in that the dimensions of analysis typically need to be identified in advance.

The focus of this article is the concept of group queries, presented in [1]. The importance of this idea in query processing has been mentioned several times in bibliography (e.g. [22], [23]) and it has been implemented in a major commercial database system [5], [4]. Furthermore, it seems quite relevant in the context of data stream systems [24] since both group- and multi-feature queries (an important subclass of group queries, presented in [25] and reviewed in Appendix B) employ single-pass evaluation techniques to compute the answer.

The concept of group queries, presented in Section 2, is a very simple, intuitive approach to model aggregation because it is based on the idea of grouping (or partitioning). A query that conceptually expresses the idea “partition the relation based on one or more attributes and for each partition compute X” constitutes a group query. This X can be something simple (e.g. the average salary of the partition) or something complex, involving joins, selections and aggregations within the partition. Formally, a query Q is a group query if the base relation(s) can be partitioned on a set of attributes, another query Q can be applied on each partition, and the answer can be obtained by combining (unioning) the individual results. This is a simple definition but provides the basis for a powerful framework to analyze complex decision support queries that aggregate over different sets of attributes: based on the notion of group queries, we can decompose a complex aggregate query to its group query components. This is the gist of Section 3.

If the definition of a group query is relaxed to allow only some of the base relations to be partitioned on a set of attributes (and the rest participate in their entirety), then the notion of partial group queries results. This notion is useful when a complex aggregate query is decomposed into its partial group query components. Decompositions not previously detectable may be identified leading to more efficient plans. This class of queries is also discussed briefly in Section 3.

In Section 4 we discuss how groupwise evaluation can be incorporated into a standard relational framework. To do so, we need a relational operator to model partitioned processing. The benefits of this have been identified by the research community [5], [26], [4], [17] and implemented into a major commercial system [5], [4]. We review these approaches and propose a generalized operator to unify them. We also argue in Section 5 that naive applications of these operators do not always lead to the most efficient plans. Decompositions of a query to its group query components have to be considered.

Certain subclasses of group queries have some nice properties that allow for special syntax and evaluation methods. One such class consists of multi-feature queries, presented in [25]. It includes many important and practical decision support queries in various domains and applications (medical [27], telecommunications [28], data streams [29]) and it has been referenced by a major commercial database system [30] in the context of OLAP functions. Multi-feature queries and extended syntaxes in general are discussed in Appendix B.

Various example queries in this article will be over the following relation: FYI_LOG (id,day,month,year,time,section,duration)FYI_BILL(cust-id,month,year,payment)

A news organization called FYI maintains a web server with up-to-date news information. FYI_LOG relation stores the log of customer requests as they access different sections of the news. id attribute is unique to each customer, day, month, year and time represent the time of the connection, section denotes some section of the news (e.g., sports, world, business, travel, politics, etc.) and duration gives the time spent on this section. The relation FYI_BILL stores information about customers’ monthly payments.

FYI organization would like to ask a variety of queries on their database in order to understand user access patterns. For example, someone could ask: “For each customer, show how many times the world section was accessed and how many times the time spent in the world section was greater than the customer's average section time.” With such a query, someone could identify customers interested in the world section.

Section snippets

Group queries

Standard SQL specifications of many complex decision-support queries are syntactically verbose, requiring several intermediate views or subqueries. As a result, it is often difficult to optimize such queries. In this section we examine a particular class of complex queries, called group queries. Many complex decision support queries exhibit the following processing: for each value x in a dimension D (e.g. for each customer), evaluate a query Q. This query Q can be something simple (e.g.

Decomposition of complex aggregate queries

The group-queries concept can be used to optimize complex decision support queries by decomposing them to simpler to evaluate components.

Groupwise processing and relational systems

Groupwise processing is a common aspect of relational queries, not only in the context of decision support but also in XML publishing [5]. Being able to identify SQL queries as group queries or decompose them into group query components is an important step into applying special algorithms that leads to substantial performance benefits. To be of use however in a standard relational system, one has to model groupwise processing as a relational operator and study its interaction with other

Experimental results

The platform was a Pentium 2 GHz machine with 256 MB RAM running Microsoft SQL Server.

We tested query Q3 using two different evaluation methods, standard SQL and groupwise processing as described in [1]. In the latter case, the total cost is given by adding the cost to sort the relation on id to the cost to process each group individually. The results are shown in Fig. 13(a). The number of groups (customers) ranges from 2000 to 6000 and the size of each group is 400 tuples (four sections of 100

Related work

A large body of related work exists on relational query optimization. Galindo-Legaria and Joshi describe in [4] a number of independent primitives to generate a set of efficient execution strategies which have been implemented within Microsoft's SQL Server. One of the proposed primitives is called SegmentApply to formulate groupwise processing (which they call segmented execution) as an algebraic operator so it can be used in a cost-based optimizer. The relational MD-join operator [17] cleanly

Conclusions and future work

Processing efficiently and understanding complex decision support queries is crucial for intensive business data analysis. A general theoretical framework has been developed to analyze, understand, and optimize complex decision support queries. This framework is based on the notion of group queries, i.e. on the notion of partitioning. We believe that our framework is very suitable for processing complex queries that aggregate over different granularities and perform complex aggregations within

Acknowledgements

This research was supported by a grant from the AT& T Foundation, by a David and Lucile Packard Foundation Fellowship in Science and Engineering, by a Sloan Foundation Fellowship, by NSF CISE Grant CDA-9625374, and by an NSF Young Investigator award and was performed while the first author was affiliated with Columbia University.

References (51)

  • D. Chatziantoniou et al.

    Groupwise Processing of Relational Queries

  • U. Dayal

    Of nests and trees: a unified approach to processing queries that contain nested subqueries, aggregates, and quantifiers

  • P. Seshadri et al.

    Complex query decorrelation

  • C. Galindo-Legaria et al.

    Orthogonal optimization of subqueries and aggregation

  • S. Chaudhuri et al.

    On relational support for XML publishing: beyond sorting and tagging

  • S. Subramanian et al.

    Cost-based optimization of decision support queries using transient views

  • A. Levy et al.

    Reasoning with aggregation constraints

  • S. Chaudhuri et al.

    Optimizing queries with aggregate views

  • W.P. Yan et al.

    Eager aggregation and lazy aggregation

  • W.P. Yan et al.

    Performing group-by before join

  • S. Chaudhuri et al.

    Including group-by in query optimization

  • F. Gingras et al.

    nD-SQL: a multi-dimensional language for interoperability and OLAP

  • M. Gyssens et al.

    A foundation for multi-dimensional databases

  • R. Agrawal et al.

    Modeling Multidimensional Databases

  • C. Li, S.W. Wang, A data model for supporting on-line analytical processing, in: Proceedings of the International...
  • C. Arbor Software Corporation, Sunnyvale, Multidimensional Analysis: Converting Corporate Data into Strategic...
  • D. Chatziantoniou et al.

    The MD-Join: An Operator for Complex OLAP

  • H. Wang et al.

    User defined aggregates in object-relational systems

  • H. Wang et al.

    Database system extensions for decision support: the AXL approach

  • G. Graefe et al.

    On the efficient gathering of sufficient statistics for classification from large SQL databases

  • K. Ross et al.

    Complex aggregation at multiple granularities

  • R. Ramakrishnan et al.

    Database Management Systems

    (2002)
  • A. Silberschatz et al.

    Database System Concepts

    (2001)
  • B. Babcock et al.

    Models and issues in data stream systems

  • D. Chatziantoniou et al.

    Querying multiple features of groups in relational databases

  • Cited by (12)

    View all citing articles on Scopus
    View full text