Methods for parallel execution of complex database queries☆
Introduction
Database systems have exploited parallelism from the very beginning (in the early 1960s), in a way similar to operating systems. An operating system allows many users to use the same hardware concurrently, i.e., in parallel, and a database system allows the same database to be concurrently accessed by many application programs. The basic idea is the same: each time an application is blocked because of a slow operation such as disk I/O, communication with a remote node, interaction with the user, the processor is re-dispatched to another application that is ready to execute. There is an important difference, though, between the way an operating system uses time sharing and what a multi-user database system does: whereas the operating system gives each user its own private resources (processes, address spaces, files, sessions, etc.), the database is a shared object. Different applications can access the same records or fields concurrently, and the problem of automatically maintaining correctness in such a parallel execution environment was one of the great challenges in database research in the 1970s. The problem was solved by the transaction concept with its sophisticated interplay of consistency checking, locking and recovery, and the resulting functionality proved an ideal basis for large distributed online information systems. The fact that all technical aspects related to locking and recovery can be completely hidden from the application makes modern SQL-style databases the first commercially used programming environments that automatically control parallelism – inter-query parallelism, that is.
For a long time, this was the only use of parallelism in databases – to the exception of a few special systems. However, in the 1980s users increasingly started developing applications that were radically different from classical online transactions: they ploughed through their databases (which, as was mentioned above, keep growing at a very high rate) in order to identify interesting patterns, trends, customer preferences, etc. – the kind of analysis that is called `decision support', `data mining', `online analytic processing' or many other names. Those applications use databases in a different way than online applications do. An online transaction is short; it typically executes some 100k instructions, touches less than 100 objects in the database, and generates less than 10 messages. However, there are many concurrent transactions of the same kind, each transaction wants to experience short response time, and the system should run at high throughput rates. Parallelism in this case means scaleup: one wants to be able to do the same on larger databases, without affecting the response times of the online transactions.
Decision support applications, on the other hand, touch millions of objects, execute 1012 instructions or more, and they typically do not execute in parallel to each other or normal online transactions. Running such applications with reasonable elapsed times requires the use of parallelism in all parts of the database system, which has significant impact on the overall architecture of the database engine and its interfaces to the operating system.
Section snippets
Levels of parallelism in a database system
Parallelism can be employed in database systems at different levels, with granules of different sizes. The most relevant types of parallelism can be classified as follows:
- •
Inter-transaction parallelism,
- •
Parallelism among database operations inside a transaction,
- •
Parallel execution of a single database operation,
- •
Parallel access to the stored data.
This list illustrates that besides data parallelism at the lower layers, database systems also have a potential for function parallelism.
Let us now
Parallel algorithms for special database operators
Let us now sketch the methods for executing two important operations in parallel. One is the join, and the other one is the creation of an index on a relation for a group of attributes. We assume that all relations involved are partitioned into n files and that there is one processor per file.2 It is assumed that the reader is at least somewhat familiar with the two operations under consideration.
Parallel database architectures
Historically, the database community distinguished three types of system architecture as platforms for parallel database systems:
Shared nothing: Each processor has its own memory and its own peripherals; communication is through messages only. Examples are: Tandem, NCR, workstation clusters, IBM SP2. This design has great fault isolation and scales well. The problem is the communication bottleneck.
Shared disk: Each processor has its own memory, but it shares the disk subsystem with other
A typical DB-operation for decision support
To illustrate the way parallelism is employed for typical, complex decision support operations, let us consider the so-called star join [14].
The best way to introduce the problem is a simple example. Assume chain-store application, for which we have the following relations:orders (orderID, custID, artID, storeID, quantity, date) customer (custID, name, type, industry, size, region) article (artID, name, price, weight, category) store (storeID, name, city, region, size)
The `orders' table is called
Summary
Decision support applications push the performance limits of today’s database systems. There are many queries users would like to run routinely that take a couple of hundred hours to complete. If somebody manages to reduce the elapsed time for these problems to a few minutes, i.e., gain two orders of magnitude, he will have a big advantage over his competitors. Parallelism is the obvious way towards this goal. Databases lend themselves naturally towards massive data parallelism. In principle,
For further reading
[1], [5], [6], [9], [11], [12], [13], [17], [18].
References (19)
- W. Becker, Dynamische Lastbalancierung im HiCon-System, Stuttgart University, IPVR, Technischer Bericht, 1994 (in...
- et al.
Concurrency Control and Recovery in Database Systems
(1987) Parallel Programming – An Introduction
(1993)- D.J. DeWitt, R.H. Gerber, Multiprocessor hash-based join algorithms, in: Proceedings of the 11th International...
- D.L. DeWitt, et al., A single user evaluation of the Gamma Database Machine, University of Wisconsin-Madison, Computer...
- M. Sherman, Architecture of the Encina distributed transaction processing family, in: Proceedings of the ACM SIGMOD...
- J. Gray, B. Horst, M. Walker, Parity striping of disc arrays: low-cost reliable storag with acceptable throughput, in:...
- et al.
Transaction Processing: Concepts and Techniques
(1992) - et al.
Parallel database systems the future of high performance database systems
Communications of the ACM
(1992)
Cited by (5)
Parallel data intensive computing in scientific and commercial applications
2002, Parallel ComputingStochastic Time Complexity Surfaces of Computing Node
2023, MathematicsReal-word application III: Developing innovative computing algorithms for astronomical time series
2010, Lecture Notes in Electrical EngineeringQoS resource management for cluster-based image retrieval systems
2005, Proceedings of the 2005 International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA'05Strategies for workload balancing in cluster-based image databases
2004, Parallel Processing Letters
- ☆
This overview uses material from a survey paper by the same author that was published under the title `Parallel database techniques in decision support and data mining' in the Proceedings of the 1997 ParCo Conference.