Methods for parallel execution of complex database queries

doi:10.1016/S0167-8191(99)00066-6

Parallel Computing

Volume 25, Issues 13–14, December 1999, Pages 2177-2188

https://doi.org/10.1016/S0167-8191(99)00066-6 Get rights and content

Abstract

During the last decade, all commercial database systems have included features for parallel processing into their products. This development has been driven by the fact that databases grow in size at considerable rates. According to the results of the 1998 `very large database contest' the world’s largest databases, which have reached a size of over 10TB, double in size every year. At that speed, they outgrow the increase in processor speed and memory size, so additional measures are required to accommodate the effects of rapidly growing volumes of data. Parallelism is one of those options. It helps to keep processing times constant, even if the size of the database increases. That effect, which is often referred to as `scaleup' is important for loading, index creation, all kinds of administrative operations on the database, and of course for long batch-type applications. Parallelism is also employed to speed-up queries that otherwise would take days or weeks to process and thus would be useless for the application. This type of requirement: fast results of complex queries on large data sets is characteristic of decision support applications. In this overview we will explain how parallelism in databases can help to solve such problems.

Introduction

Database systems have exploited parallelism from the very beginning (in the early 1960s), in a way similar to operating systems. An operating system allows many users to use the same hardware concurrently, i.e., in parallel, and a database system allows the same database to be concurrently accessed by many application programs. The basic idea is the same: each time an application is blocked because of a slow operation such as disk I/O, communication with a remote node, interaction with the user, the processor is re-dispatched to another application that is ready to execute. There is an important difference, though, between the way an operating system uses time sharing and what a multi-user database system does: whereas the operating system gives each user its own private resources (processes, address spaces, files, sessions, etc.), the database is a shared object. Different applications can access the same records or fields concurrently, and the problem of automatically maintaining correctness in such a parallel execution environment was one of the great challenges in database research in the 1970s. The problem was solved by the transaction concept with its sophisticated interplay of consistency checking, locking and recovery, and the resulting functionality proved an ideal basis for large distributed online information systems. The fact that all technical aspects related to locking and recovery can be completely hidden from the application makes modern SQL-style databases the first commercially used programming environments that automatically control parallelism – inter-query parallelism, that is.

For a long time, this was the only use of parallelism in databases – to the exception of a few special systems. However, in the 1980s users increasingly started developing applications that were radically different from classical online transactions: they ploughed through their databases (which, as was mentioned above, keep growing at a very high rate) in order to identify interesting patterns, trends, customer preferences, etc. – the kind of analysis that is called `decision support', `data mining', `online analytic processing' or many other names. Those applications use databases in a different way than online applications do. An online transaction is short; it typically executes some 100k instructions, touches less than 100 objects in the database, and generates less than 10 messages. However, there are many concurrent transactions of the same kind, each transaction wants to experience short response time, and the system should run at high throughput rates. Parallelism in this case means scaleup: one wants to be able to do the same on larger databases, without affecting the response times of the online transactions.

Decision support applications, on the other hand, touch millions of objects, execute 10¹² instructions or more, and they typically do not execute in parallel to each other or normal online transactions. Running such applications with reasonable elapsed times requires the use of parallelism in all parts of the database system, which has significant impact on the overall architecture of the database engine and its interfaces to the operating system.

Section snippets

Levels of parallelism in a database system

Parallelism can be employed in database systems at different levels, with granules of different sizes. The most relevant types of parallelism can be classified as follows:

•
Inter-transaction parallelism,
•
Parallelism among database operations inside a transaction,
•
Parallel execution of a single database operation,
•
Parallel access to the stored data.

This list illustrates that besides data parallelism at the lower layers, database systems also have a potential for function parallelism.

Let us now

Parallel algorithms for special database operators

Let us now sketch the methods for executing two important operations in parallel. One is the join, and the other one is the creation of an index on a relation for a group of attributes. We assume that all relations involved are partitioned into n files and that there is one processor per file.² It is assumed that the reader is at least somewhat familiar with the two operations under consideration.

Parallel database architectures

Historically, the database community distinguished three types of system architecture as platforms for parallel database systems:

Shared nothing: Each processor has its own memory and its own peripherals; communication is through messages only. Examples are: Tandem, NCR, workstation clusters, IBM SP2. This design has great fault isolation and scales well. The problem is the communication bottleneck.

Shared disk: Each processor has its own memory, but it shares the disk subsystem with other

A typical DB-operation for decision support

To illustrate the way parallelism is employed for typical, complex decision support operations, let us consider the so-called star join [14].

The best way to introduce the problem is a simple example. Assume chain-store application, for which we have the following relations:

orders (orderID, custID, artID, storeID, quantity, date)

customer (custID, name, type, industry, size, region)

article (artID, name, price, weight, category)

store (storeID, name, city, region, size)

The `orders' table is called

Summary

Decision support applications push the performance limits of today’s database systems. There are many queries users would like to run routinely that take a couple of hundred hours to complete. If somebody manages to reduce the elapsed time for these problems to a few minutes, i.e., gain two orders of magnitude, he will have a big advantage over his competitors. Parallelism is the obvious way towards this goal. Databases lend themselves naturally towards massive data parallelism. In principle,

For further reading

[1], [5], [6], [9], [11], [12], [13], [17], [18].

References (19)

W. Becker, Dynamische Lastbalancierung im HiCon-System, Stuttgart University, IPVR, Technischer Bericht, 1994 (in...
P.A. Bernstein et al.
Concurrency Control and Recovery in Database Systems
(1987)
T. Bräunl
Parallel Programming – An Introduction
(1993)
D.J. DeWitt, R.H. Gerber, Multiprocessor hash-based join algorithms, in: Proceedings of the 11th International...
D.L. DeWitt, et al., A single user evaluation of the Gamma Database Machine, University of Wisconsin-Madison, Computer...
M. Sherman, Architecture of the Encina distributed transaction processing family, in: Proceedings of the ACM SIGMOD...
J. Gray, B. Horst, M. Walker, Parity striping of disc arrays: low-cost reliable storag with acceptable throughput, in:...
J. Gray et al.
Transaction Processing: Concepts and Techniques
(1992)
D.J. Dewitt et al.
Parallel database systems the future of high performance database systems
Communications of the ACM
(1992)

There are more references available in the full text version of this article.

Cited by (5)

Parallel data intensive computing in scientific and commercial applications
2002, Parallel Computing
Applications that explore, query, analyze, visualize, and, in general, process very large scale data sets are known as Data Intensive Applications. Large scale data intensive computing plays an increasingly important role in many scientific activities and commercial applications, whether it involves data mining of commercial transactions, experimental data analysis and visualization, or intensive simulation such as climate modeling. By combining high performance computation, very large data storage, high bandwidth access, and high-speed local and wide area networking, data intensive computing enhances the technical capabilities and usefulness of most systems. The integration of parallel and distributed computational environments will produce major improvements in performance for both computing intensive and data intensive applications in the future. The purpose of this introductory article is to provide an overview of the main issues in parallel data intensive computing in scientific and commercial applications and to encourage the reader to go into the more in-depth articles later in this special issue.
Stochastic Time Complexity Surfaces of Computing Node
2023, Mathematics
Real-word application III: Developing innovative computing algorithms for astronomical time series
2010, Lecture Notes in Electrical Engineering
QoS resource management for cluster-based image retrieval systems
2005, Proceedings of the 2005 International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA'05
Strategies for workload balancing in cluster-based image databases
2004, Parallel Processing Letters

^☆: This overview uses material from a survey paper by the same author that was published under the title `Parallel database techniques in decision support and data mining' in the Proceedings of the 1997 ParCo Conference.

View full text

Methods for parallel execution of complex database queries☆

Abstract

Introduction

Section snippets

Levels of parallelism in a database system

Parallel algorithms for special database operators

Parallel database architectures

A typical DB-operation for decision support

Summary

For further reading

Concurrency Control and Recovery in Database Systems

Parallel Programming – An Introduction

Transaction Processing: Concepts and Techniques

Parallel database systems the future of high performance database systems

Communications of the ACM