Empirical performance modeling for parallel weather prediction codes

doi:10.1016/S0167-8191(99)00082-4

Parallel Computing

Volume 25, Issues 13–14, December 1999, Pages 2135-2148

https://doi.org/10.1016/S0167-8191(99)00082-4 Get rights and content

Abstract

Performance modeling for large industrial or scientific codes is of value for program tuning or for selection of new machines when benchmarking is not yet possible. We discuss an empirical method of estimating runtime for certain large parallel programs where computational work is estimated by regression functions based on measurements and time cost of communication is modeled by program analysis and benchmarks for communication primitives. The method is demonstrated with the local weather model (LM) of the German Weather Service (DWD) on SP-2, T3E, and SX-4. The method is an economic way of developing performance models because only a moderate number of measurements is required. The resulting model is sufficiently accurate even for very large test cases.

Introduction

If a large industrial or scientific code is developed, the first goal is of course implementing a correct program. When the program is running correctly, other aspects like program tuning for particular machines become important. Before a program can be improved the developer has to understand in which way the program uses the machine. Among other methods performance modeling is a possibility for better understanding a program run. Performance modeling is the only way of investigating the system behavior if the target machine is not yet installed. This situation often occurs if a new machine has to be selected and only a small version is available for testing.

We consider an empirical method of estimating runtime for certain large parallel programs. The proposed method requires a relatively small number of measurements and moderate effort for developing a performance model for a program which is known to a very restricted extent only. The modeling itself is based on several heuristic elements such as assumptions on the program behavior. The method is shown to work well through practical experience rather than by theoretical consideration.

We demonstrate the method with the local weather model (LM) of the German Weather Service (DWD). This code needs hours or even days on the largest available machines for test cases which are of interest for the next couple of years. Therefore, only few measurements for examples of small or moderate size could be provided for developing a performance model. Nevertheless, high accuracy is achieved even for large test cases.

The mathematical background of weather prediction models is the solution of partial differential equations. The original equations are discretized and the solution is obtained on a grid by discrete time stepping. The dimension of the grid (3D or 2D) depends on the considered approach. The basic conception of parallelization is usually grid partitioning and message passing. All activities on a subgrid are summarized in a process. In general, the main loop of such programs consists of a sequence of several phases like dynamics computations, physics computations, FFT, Gauss-elimination, local communication, global communication, input/output of results, and transposition of data within the system of parallel processes. All corresponding activities of the processes run synchronously because of data dependencies and by communication phases of different kind. Therefore, a simple additive model can be used which estimates the overall runtime by adding the runtime of all phases.

For the LM, we have to estimate the following phases only: dynamics computations, physics computations, 2D local communication, and global communication. The parallel version is a message passing program using Message Passing Interface (MPI) primitives for communication. The development of models for all phases of the program is described explicitly. For proving the reliability of the model, we show the deviation of model evaluation and runtime values for a variety of test cases. Some of the test cases are used for model development, others are used for testing the performance prediction only. Communication is concentrated in a few routines, thus it can easily be analyzed by reading the program code. Benchmarks for local and global communication primitives complement this skeleton in order to get a performance model.

The computational phases of the LM depend linearly on the number of grid points in each of the three dimensions and in the number of time steps. Therefore, each parallel run gives a considerable number of data (i.e. a complete collection of timing values per process) for estimating the parameters of an appropriate multilinear form which describes the runtime of the corresponding computational phase. For most phases load imbalancing is caused by the distribution of grid points if edge length of the grid and of the process structure do not match. Physics computations, however, depend strongly on the local weather situation. Load imbalancing caused by this reason cannot be analyzed statically. Since the processes run synchronously, the slowest process determines the duration of a phase of physics computations. Therefore, only a maximum of the timing values for all parallel processes of a test run is of interest which is related to the number of grid points. For economic reasons, only few such values are available. It turned out empirically that in this situation the ratio of timing values for physics and dynamics computations is more appropriate for finding a regression function than timing values of physics computations directly.

The need of spending some extra time consuming calculations for some grid points during physics computations, which is caused by the local weather, can lead to two different effects. If a processor has to execute these calculations for all its grid points as soon as it is required for one point only, the runtime scales down with growing system size in a rather simple way. We always anticipate this system behavior. However, if these extra calculations have to be performed with a few grid points only, there is a clear advantage for small systems. Because a small part of physics computations may be executed in this way, the runtime of small systems is overestimated and the runtime of large systems is underestimated a little.

For the present paper, the reference machine for the development of all performance models of computational phases is an IBM SP-2. The model was extended to several other target machines such as Cray T3E, SGI Origin, and NEC SX-4 using only communication benchmarks and some short calibration runs of the LM for each machine. Here we discuss some results for T3E and SX-4 only. Performance modeling is also able to predict the performance of future machines which are known by their design only. Throughout the present paper, we restrict the consideration to existing machines or those which belong to the product line of a manufacturer. In [1] some early results were presented which have been improved in particular for vector machines.

The present paper describes a method to develop a performance model for a large weather code on the basis of a few measurements for application cases of moderate size. In addition, we use a model for the LM to discuss problems like scalability, tuning of the LM on vector machines, and runtime requirements of the DWD. Section 2 describes the rationale of the method proposed. Section 3 deals with estimating the time cost of computational phases like dynamics computations or physics computations. We discuss some problems of benchmarking communication routines in Section 4. Results for the LM on an SP-2 for reference machine and for other target machines (T3E and SX-4) are discussed in Section 5. We summarize our experiences in some concluding remarks.

Section snippets

Basic conception of modeling

At the beginning of performance modeling for a parallel program, the work has to be decomposed into sequentially running parts the runtime of which can be modeled separately. We concentrate on the main loop of the LM and neglect estimating the initial, intermediate, and final input or output phases. Under these restrictions, we first have to identify computational phases of the program which are separated by calls of data exchange procedures. These data exchange steps contain MPI_Barrier calls

Dynamics computations

X−4 is the number of interior grid points in the direction of a latitude and Y−4 in the direction of a longitude. For p=s×r processes arranged in r rows and s columns, we obtain x×y interior grid points per process, where x is given by ⌊(X−4)/s⌋ or ⌈(X−4)/s⌉ and y by ⌊(Y−4)/r⌋ or ⌈(Y−4)/r⌉. All processes show two additional lines of boundary points on each side. These boundary lines belong to the exterior boundary for corresponding processes. The number of outer boundary lines of a process is

Benchmarking the communication routines

The main problem with benchmarking communication activities is that we are interested in the time difference for the original program and a hypothetical program which does not execute the considered communication but which shows the same synchronization. As soon as there is some overlapping of communication and computation, the latter program is not well defined and cannot be measured. In our case, communication and computation is mostly executed sequentially. Therefore, we can estimate

Results

The results for our reference machine (i.e. SP-2) are presented in Table 3. The deviation of estimated runtime from measurements was below 10% for all cases. We ported our model to a T3E running at 300 MHz. The T3E was faster than the SP-2 by the factor 8.7 for dynamics computations and 6.0 for physics computations. Therefore, the c_di had to be divided by these factors in order to port the model of the computing part. Case 51×51×20−2×2−60 has been used for calibration.

Once the model is

Concluding remarks

A method has been presented which allows the development of a performance model for large parallel message passing programs even if the program is not known in detail. A skeleton for the communication structure has to be developed by considering the code. In addition, a moderate number of parallel runs have to be executed on a reference machine in order to get the required timing values. These test cases can be of moderate size even if performance prediction for large cases is planned. The

Acknowledgements

The authors would like to thank Prof. G. Hoffmann (DWD) who stimulated this investigation, Dr. U. Schättler (DWD) for his advice and for running the benchmarks on T3E, E. Tschirschnitz (NEC) for some helpful discussions and for running the benchmarks on SX-4, and last but not least Kläre Cassirer and R. Hess (GMD) who implemented and ran most of the benchmarks.

References (3)

O. Bröker, K. Cassirer, R. Hess, W. Joppich, H. Mierendorff, Design and performance prediction for parallel global and...

There are more references available in the full text version of this article.

Cited by (0)

View full text