A cost-optimal parallel implementation of a tridiagonal system solver using skeletons☆
Introduction
The design of parallel algorithms and their implementation on parallel machines is a complex and error-prone process. Traditionally, application programmers take a sequential algorithm and use their experience to find a parallel implementation in an ad hoc manner. A more systematic approach is to use well-defined, reusable parallel components, called skeletons [3]. A skeleton can be viewed as a higher- order function, customizable for a particular application by means of parameters provided by the application programmer. The programmer expresses an application using skeletons as high-level language constructs, whose efficient implementations for particular parallel machines are provided by a compiler or library. The expected performance of a skeleton-based program can be estimated early on, thus allowing facilitating a performance-directed design process.
This paper addresses a practically relevant case study — solving a tridiagonal system (TdS) of linear equations. TdS solvers are notoriously difficult to parallelize: their sparse structure provides relatively little potential parallelism, while communication demand is relatively high (see [9] for an overview and Section 6 for more details).
The paper’s main contribution is, unlike previous ad hoc approaches, the transformation of an intuitive sequential formulation of TdS into a skeleton-based form in a systematic manner. As a result, we obtain an efficient, cost-optimal parallel implementation in message passing interface (MPI).
We start by introducing basic parallel skeletons (Section 2) and then express our case study – the tridiagonal system solver – using these skeletons (Section 3). We make a design decision between two possible parallel solutions using analytical performance estimates and arrive at a cost-optimal implementation (Section 4). We experimentally study the performance of the developed MPI implementations on a Cray T3E machine (Section 5), and conclude by discussing our results in the context of related work.
Section snippets
Basic data-parallel skeletons
In this section, we introduce basic data-parallel skeletons as higher-order functions defined on non-empty lists; function application is denoted by juxtaposition: :
- •
Map: Applying a unary function to all elements of a list:
- •
Zip: Element-wise application of a binary operator to lists of equal length:
- •
Scan-left and scan-right: Computing prefix sums of a list by traversing the list from left to right (or vice versa)
Tridiagonal system solver using basic skeletons
We consider the solution of a tridiagonal system (TdS) of linear equations, , where A is an matrix representing coefficients, a vector of unknowns and the right-hand-side vector. The only values of matrix A unequal to 0 are on the main diagonal as well as above and below it (we call them the upper and lower diagonal, respectively).
A typical sequential algorithm for TdS is Gaussian elimination [8], [10] which eliminates the lower and upper diagonal of the matrix (Fig. 1). Both the
Towards a cost-optimal implementation
Our approach to parallelizing the function is to express it in terms of skeletons that provide more potential for parallelism than the basic skeletons used so far. Our first candidate is the distributable homomorphism (DH) skeleton, first introduced in [4]:
Definition 1 The DH skeleton is a higher-order function with two parameter operators, and , defined as follows for arbitrary lists and of equal length:
The DH skeleton is a special form of the
Experimental results
In this section, we briefly report experimental performance results for the parallel version of the tridiagonal system solver developed in this paper. The measurements were carried out on a Cray T3E machine with 24 processors of type Alpha 21164, 300 MHz, 128 MB, using native MPI implementation.
The two plots in Fig. 2 (left) compare the runtimes of the optimal sequential algorithm with our cost-optimal parallel version depending on the problem size. The cost-optimal solution demonstrates an
Related work and conclusions
The main contribution of this paper is the systematic design of a cost-optimal parallel implementation for a tridiagonal system solver. The important feature of our design is that it is based on well-defined parallel components (skeletons), which can be re-used for different applications. The design process began with an intuitively correct sequential version of the algorithm and proceeded by choosing a suitable parallel skeleton taking into account the expected performance. Furthermore, we
Acknowledgements
We are grateful to Emanuel Kitzelmann who helped a lot in proving Theorem 1, to Martin Alt for many fruitful discussions, to anonymous referees for helpful comments and suggestions, and to Julia Kaiser-Mariani who assisted in improving the presentation.
Holger Bischof graduated with a diploma in computer science from the University of Passau, Germany in 1999. Subsequently, he worked as a Research Associate at the RWTH Aachen (1999–2000), and at the Technical University of Berlin (2000–2003). Since November 2003, Holger Bischof is at the University of Mnster, Germany, where he is completing his doctoral thesis. His current research area includes the development of parallel programs using skeletons, formal methods for parallel and distributed
References (13)
- H. Bischof, S. Gorlatch, E. Kitzelmann, The double-scan skeleton and its parallelization, Technical Report 2002/06,...
- et al.
Cost optimality and predictability of parallel programming with skeletons
- M.I. Cole. Algorithmic Skeletons: A Structured Approach to the Management of Parallel Computation, Ph.D. thesis,...
Systematic efficient parallelization of scan and other list homomorphisms
- et al.
A generic MPI implementation for a data-parallel skeleton: formal derivation and application to FFT
Parallel Process. Lett.
(1998) A fast direct solution of poisson’s equation using fourier analysis
JACM
(1965)
Cited by (4)
A fast singular value decomposition algorithm of general k-tridiagonal matrices
2019, Journal of Computational ScienceImplementation of Cubic Spline Interpolation on Parallel Skeleton Using Pipeline Model on CPU-GPU Cluster
2016, Proceedings - 6th International Advanced Computing Conference, IACC 2016Uniform high-level programming of many-core and multi-GPU systems
2013, Advances in Parallel ComputingLessons from implementing the BiCGStab method with SkeTo library
2010, Proceedings of the ACM SIGPLAN International Conference on Functional Programming, ICFP
Holger Bischof graduated with a diploma in computer science from the University of Passau, Germany in 1999. Subsequently, he worked as a Research Associate at the RWTH Aachen (1999–2000), and at the Technical University of Berlin (2000–2003). Since November 2003, Holger Bischof is at the University of Mnster, Germany, where he is completing his doctoral thesis. His current research area includes the development of parallel programs using skeletons, formal methods for parallel and distributed systems, and performance evaluation.
Sergei Gorlatch received his Master’s degree in Computer Science from Kiev State University in 1979, and his PhD degree from Glushkov Institute of Cybernetics, Kiev, Ukraine in 1984. From 1991 to 1992, he was a Humboldt Research Fellow at the Technical University of Munich. From 1992 to 1999, Dr. Gorlatch worked as Assistant Professor at the University of Passau, Germany, where he obtained his “Habilitation” (post-doctoral degree) in 1998. From 2000 to 2003, he was Associate Professor at the Technical University of Berlin. Since October 2003, Sergei Gorlatch is Professor of Computer Science at the University of Münster. His current research area includes parallel algorithms, programming methodology and formal methods for parallel and distributed systems, and performance evaluation.
- ☆
Parts of this paper were presented at PaCT’03 (Nizhni Novgorod) and EuroPar’02 (Paderborn).