A cost-optimal parallel implementation of a tridiagonal system solver using skeletons

https://doi.org/10.1016/j.future.2004.05.015Get rights and content

Abstract

We address the task of systematically designing efficient programs for parallel machines. Our approach starts with a sequential algorithm and proceeds by expressing it in terms of standard, pre-implemented parallel components called skeletons. We demonstrate the skeleton-based design process using a tridiagonal system solver as our example application. We develop a cost-optimal parallel version of our application and implement it in message passing interface (MPI). The performance of our solution is demonstrated experimentally on a Cray T3E machine.

Introduction

The design of parallel algorithms and their implementation on parallel machines is a complex and error-prone process. Traditionally, application programmers take a sequential algorithm and use their experience to find a parallel implementation in an ad hoc manner. A more systematic approach is to use well-defined, reusable parallel components, called skeletons [3]. A skeleton can be viewed as a higher- order function, customizable for a particular application by means of parameters provided by the application programmer. The programmer expresses an application using skeletons as high-level language constructs, whose efficient implementations for particular parallel machines are provided by a compiler or library. The expected performance of a skeleton-based program can be estimated early on, thus allowing facilitating a performance-directed design process.

This paper addresses a practically relevant case study — solving a tridiagonal system (TdS) of linear equations. TdS solvers are notoriously difficult to parallelize: their sparse structure provides relatively little potential parallelism, while communication demand is relatively high (see [9] for an overview and Section 6 for more details).

The paper’s main contribution is, unlike previous ad hoc approaches, the transformation of an intuitive sequential formulation of TdS into a skeleton-based form in a systematic manner. As a result, we obtain an efficient, cost-optimal parallel implementation in message passing interface (MPI).

We start by introducing basic parallel skeletons (Section 2) and then express our case study – the tridiagonal system solver – using these skeletons (Section 3). We make a design decision between two possible parallel solutions using analytical performance estimates and arrive at a cost-optimal implementation (Section 4). We experimentally study the performance of the developed MPI implementations on a Cray T3E machine (Section 5), and conclude by discussing our results in the context of related work.

Section snippets

Basic data-parallel skeletons

In this section, we introduce basic data-parallel skeletons as higher-order functions defined on non-empty lists; function application is denoted by juxtaposition: fx=f(x):

  • Map: Applying a unary function f to all elements of a list:mapf[x1,,xn]=[fx1,,fxn]

  • Zip: Element-wise application of a binary operator to lists of equal length:zip()([x1,,xn],[y1,,yn])=[(x1y1),,(xnyn)]

  • Scan-left and scan-right: Computing prefix sums of a list by traversing the list from left to right (or vice versa)

Tridiagonal system solver using basic skeletons

We consider the solution of a tridiagonal system (TdS) of linear equations, Ax=b, where A is an n×n matrix representing coefficients, x a vector of unknowns and b the right-hand-side vector. The only values of matrix A unequal to 0 are on the main diagonal as well as above and below it (we call them the upper and lower diagonal, respectively).

A typical sequential algorithm for TdS is Gaussian elimination [8], [10] which eliminates the lower and upper diagonal of the matrix (Fig. 1). Both the

Towards a cost-optimal implementation

Our approach to parallelizing the function TdS is to express it in terms of skeletons that provide more potential for parallelism than the basic skeletons used so far. Our first candidate is the distributable homomorphism (DH) skeleton, first introduced in [4]:

Definition 1

The DH skeleton is a higher-order function with two parameter operators, and , defined as follows for arbitrary lists x and y of equal length:dh(,)(x++y)=zip()(dhx,dhy)++zip()(dhx,dhy)

The DH skeleton is a special form of the

Experimental results

In this section, we briefly report experimental performance results for the parallel version of the tridiagonal system solver developed in this paper. The measurements were carried out on a Cray T3E machine with 24 processors of type Alpha 21164, 300 MHz, 128 MB, using native MPI implementation.

The two plots in Fig. 2 (left) compare the runtimes of the optimal sequential algorithm with our cost-optimal parallel version depending on the problem size. The cost-optimal solution demonstrates an

Related work and conclusions

The main contribution of this paper is the systematic design of a cost-optimal parallel implementation for a tridiagonal system solver. The important feature of our design is that it is based on well-defined parallel components (skeletons), which can be re-used for different applications. The design process began with an intuitively correct sequential version of the algorithm and proceeded by choosing a suitable parallel skeleton taking into account the expected performance. Furthermore, we

Acknowledgements

We are grateful to Emanuel Kitzelmann who helped a lot in proving Theorem 1, to Martin Alt for many fruitful discussions, to anonymous referees for helpful comments and suggestions, and to Julia Kaiser-Mariani who assisted in improving the presentation.

Holger Bischof graduated with a diploma in computer science from the University of Passau, Germany in 1999. Subsequently, he worked as a Research Associate at the RWTH Aachen (1999–2000), and at the Technical University of Berlin (2000–2003). Since November 2003, Holger Bischof is at the University of Mnster, Germany, where he is completing his doctoral thesis. His current research area includes the development of parallel programs using skeletons, formal methods for parallel and distributed

References (13)

  • H. Bischof, S. Gorlatch, E. Kitzelmann, The double-scan skeleton and its parallelization, Technical Report 2002/06,...
  • H. Bischof et al.

    Cost optimality and predictability of parallel programming with skeletons

  • M.I. Cole. Algorithmic Skeletons: A Structured Approach to the Management of Parallel Computation, Ph.D. thesis,...
  • S. Gorlatch

    Systematic efficient parallelization of scan and other list homomorphisms

  • S. Gorlatch et al.

    A generic MPI implementation for a data-parallel skeleton: formal derivation and application to FFT

    Parallel Process. Lett.

    (1998)
  • R.W. Hockney

    A fast direct solution of poisson’s equation using fourier analysis

    JACM

    (1965)
There are more references available in the full text version of this article.

Cited by (4)

Holger Bischof graduated with a diploma in computer science from the University of Passau, Germany in 1999. Subsequently, he worked as a Research Associate at the RWTH Aachen (1999–2000), and at the Technical University of Berlin (2000–2003). Since November 2003, Holger Bischof is at the University of Mnster, Germany, where he is completing his doctoral thesis. His current research area includes the development of parallel programs using skeletons, formal methods for parallel and distributed systems, and performance evaluation.

Sergei Gorlatch received his Master’s degree in Computer Science from Kiev State University in 1979, and his PhD degree from Glushkov Institute of Cybernetics, Kiev, Ukraine in 1984. From 1991 to 1992, he was a Humboldt Research Fellow at the Technical University of Munich. From 1992 to 1999, Dr. Gorlatch worked as Assistant Professor at the University of Passau, Germany, where he obtained his “Habilitation” (post-doctoral degree) in 1998. From 2000 to 2003, he was Associate Professor at the Technical University of Berlin. Since October 2003, Sergei Gorlatch is Professor of Computer Science at the University of Münster. His current research area includes parallel algorithms, programming methodology and formal methods for parallel and distributed systems, and performance evaluation.

Parts of this paper were presented at PaCT’03 (Nizhni Novgorod) and EuroPar’02 (Paderborn).

View full text