Partitioned optimization of complex queries
Introduction
In a declarative query language the user specifies what to compute and not how to compute it. Declarativeness is an attractive feature since writing queries in a such language can be significantly easier for a non-programmer and thus less time-consuming. Usually there is an underlying algebra or a set of operators that declarative queries translate to. Optimization techniques for these languages are highly desirable since naive evaluations take too long.
This article deals with complex aggregate queries in relational databases. A complex aggregate query is a query with several query blocks (views or subqueries) correlated together with multiple joins. Most of these query blocks involve grouping and aggregation. In this work subqueries are not considered since very often these can be transformed to an equivalent expression using views [2], [3]. Complex aggregate queries are essential in decision support and data warehousing applications. Being able to execute efficiently and optimize properly these queries can lead to huge performance gains [1], [4], [5]. As a result, a significant number of papers exist in the literature dealing with optimization of aggregate queries [4], [6], [7], [8], [9], [10], [11].
Not long ago, when people were talking about aggregation they meant grouping of a relation R and applying an aggregate function (like average or min) on each group. That is not the case in today's applications such as decision support or data mining. Users are not any longer interested in just the average salary of each department. They want to use that average salary to get the employees of the departments that are paid over their departmental average. Then they want to group further these employees based on their marital status or sex. In other words users want to perform complex processing within each group and possibly grouping among different sets of attributes. Optimization of these queries is different from traditional techniques. Special models (multi-dimensional models [12], [13], [14], [15]), databases (multi-dimensional databases [16]) and operators [17], [18], [19], [20], [21] have been proposed to treat these individual characteristics of decision support queries efficiently. However, relational systems represent a big fraction of today's systems and SQL is the standard query language of these systems. Further, multi-dimensional systems are less flexible in that the dimensions of analysis typically need to be identified in advance.
The focus of this article is the concept of group queries, presented in [1]. The importance of this idea in query processing has been mentioned several times in bibliography (e.g. [22], [23]) and it has been implemented in a major commercial database system [5], [4]. Furthermore, it seems quite relevant in the context of data stream systems [24] since both group- and multi-feature queries (an important subclass of group queries, presented in [25] and reviewed in Appendix B) employ single-pass evaluation techniques to compute the answer.
The concept of group queries, presented in Section 2, is a very simple, intuitive approach to model aggregation because it is based on the idea of grouping (or partitioning). A query that conceptually expresses the idea “partition the relation based on one or more attributes and for each partition compute X” constitutes a group query. This X can be something simple (e.g. the average salary of the partition) or something complex, involving joins, selections and aggregations within the partition. Formally, a query Q is a group query if the base relation(s) can be partitioned on a set of attributes, another query can be applied on each partition, and the answer can be obtained by combining (unioning) the individual results. This is a simple definition but provides the basis for a powerful framework to analyze complex decision support queries that aggregate over different sets of attributes: based on the notion of group queries, we can decompose a complex aggregate query to its group query components. This is the gist of Section 3.
If the definition of a group query is relaxed to allow only some of the base relations to be partitioned on a set of attributes (and the rest participate in their entirety), then the notion of partial group queries results. This notion is useful when a complex aggregate query is decomposed into its partial group query components. Decompositions not previously detectable may be identified leading to more efficient plans. This class of queries is also discussed briefly in Section 3.
In Section 4 we discuss how groupwise evaluation can be incorporated into a standard relational framework. To do so, we need a relational operator to model partitioned processing. The benefits of this have been identified by the research community [5], [26], [4], [17] and implemented into a major commercial system [5], [4]. We review these approaches and propose a generalized operator to unify them. We also argue in Section 5 that naive applications of these operators do not always lead to the most efficient plans. Decompositions of a query to its group query components have to be considered.
Certain subclasses of group queries have some nice properties that allow for special syntax and evaluation methods. One such class consists of multi-feature queries, presented in [25]. It includes many important and practical decision support queries in various domains and applications (medical [27], telecommunications [28], data streams [29]) and it has been referenced by a major commercial database system [30] in the context of OLAP functions. Multi-feature queries and extended syntaxes in general are discussed in Appendix B.
Various example queries in this article will be over the following relation:
A news organization called FYI maintains a web server with up-to-date news information. FYI_LOG relation stores the log of customer requests as they access different sections of the news. id attribute is unique to each customer, day, month, year and time represent the time of the connection, section denotes some section of the news (e.g., sports, world, business, travel, politics, etc.) and duration gives the time spent on this section. The relation FYI_BILL stores information about customers’ monthly payments.
FYI organization would like to ask a variety of queries on their database in order to understand user access patterns. For example, someone could ask: “For each customer, show how many times the world section was accessed and how many times the time spent in the world section was greater than the customer's average section time.” With such a query, someone could identify customers interested in the world section.
Section snippets
Group queries
Standard SQL specifications of many complex decision-support queries are syntactically verbose, requiring several intermediate views or subqueries. As a result, it is often difficult to optimize such queries. In this section we examine a particular class of complex queries, called group queries. Many complex decision support queries exhibit the following processing: for each value x in a dimension D (e.g. for each customer), evaluate a query . This query can be something simple (e.g.
Decomposition of complex aggregate queries
The group-queries concept can be used to optimize complex decision support queries by decomposing them to simpler to evaluate components.
Groupwise processing and relational systems
Groupwise processing is a common aspect of relational queries, not only in the context of decision support but also in XML publishing [5]. Being able to identify SQL queries as group queries or decompose them into group query components is an important step into applying special algorithms that leads to substantial performance benefits. To be of use however in a standard relational system, one has to model groupwise processing as a relational operator and study its interaction with other
Experimental results
The platform was a Pentium 2 GHz machine with 256 MB RAM running Microsoft SQL Server.
We tested query Q3 using two different evaluation methods, standard SQL and groupwise processing as described in [1]. In the latter case, the total cost is given by adding the cost to sort the relation on id to the cost to process each group individually. The results are shown in Fig. 13(a). The number of groups (customers) ranges from 2000 to 6000 and the size of each group is 400 tuples (four sections of 100
Related work
A large body of related work exists on relational query optimization. Galindo-Legaria and Joshi describe in [4] a number of independent primitives to generate a set of efficient execution strategies which have been implemented within Microsoft's SQL Server. One of the proposed primitives is called SegmentApply to formulate groupwise processing (which they call segmented execution) as an algebraic operator so it can be used in a cost-based optimizer. The relational MD-join operator [17] cleanly
Conclusions and future work
Processing efficiently and understanding complex decision support queries is crucial for intensive business data analysis. A general theoretical framework has been developed to analyze, understand, and optimize complex decision support queries. This framework is based on the notion of group queries, i.e. on the notion of partitioning. We believe that our framework is very suitable for processing complex queries that aggregate over different granularities and perform complex aggregations within
Acknowledgements
This research was supported by a grant from the AT& T Foundation, by a David and Lucile Packard Foundation Fellowship in Science and Engineering, by a Sloan Foundation Fellowship, by NSF CISE Grant CDA-9625374, and by an NSF Young Investigator award and was performed while the first author was affiliated with Columbia University.
References (51)
- et al.
Groupwise Processing of Relational Queries
Of nests and trees: a unified approach to processing queries that contain nested subqueries, aggregates, and quantifiers
- et al.
Complex query decorrelation
- et al.
Orthogonal optimization of subqueries and aggregation
- et al.
On relational support for XML publishing: beyond sorting and tagging
- et al.
Cost-based optimization of decision support queries using transient views
- et al.
Reasoning with aggregation constraints
- et al.
Optimizing queries with aggregate views
- et al.
Eager aggregation and lazy aggregation
- et al.
Performing group-by before join
Including group-by in query optimization
nD-SQL: a multi-dimensional language for interoperability and OLAP
A foundation for multi-dimensional databases
Modeling Multidimensional Databases
The MD-Join: An Operator for Complex OLAP
User defined aggregates in object-relational systems
Database system extensions for decision support: the AXL approach
On the efficient gathering of sufficient statistics for classification from large SQL databases
Complex aggregation at multiple granularities
Database Management Systems
Database System Concepts
Models and issues in data stream systems
Querying multiple features of groups in relational databases
Cited by (12)
Semantic optimization of query transformation in a large-scale peer-to-peer network
2012, NeurocomputingCitation Excerpt :Traditionally, many studies [17] have been done on query expansion and rewriting for distributed databases. In [18], complex query for RDBMS has been investigated on how to conduct query modulation. Particularly, in semantic web community, semantics and knowledge can support the query processes [10–12,19].
Optimizing Big-Data Queries Using Program Synthesis
2017, SOSP 2017 - Proceedings of the 26th ACM Symposium on Operating Systems PrinciplesEnergy analysis of hardware and software range partitioning
2014, ACM Transactions on Computer SystemsHardware partitioning for big data analytics
2014, IEEE MicroNavigating big data with high-throughput, energy-efficient data partitioning
2013, Proceedings - International Symposium on Computer Architecture