Early prediction of MPP performance: The SP2, T3D, and Paragon experiences

doi:10.1016/0167-8191(96)00034-8

Parallel Computing

Volume 22, Issue 7, 1 October 1996, Pages 917-942

https://doi.org/10.1016/0167-8191(96)00034-8 Get rights and content

Abstract

The performance of Massively Parallel Processors (MPPs) is attributed to a large number of machine and program factors. Software development for MPP applications is often very costly. The high cost is partially caused by a lack of early prediction of MPP performance. The program development cycle may iterate many times before achieving the desired performance level. In this paper, we present an early prediction scheme we have developed at the University of Southern California for reducing the cost of application software development. Using workload analysis and overhead estimation, our scheme optimizes the design of parallel algorithm before entering the tedious coding, debugging, and testing cycle of the applications. The scheme is generally applied at user/programmer level, not tied to any particular machine platform or any specific software environment. We have tested the effectiveness of this early performance prediction scheme by running the MIT/STAP benchmark programs on a 400-node IBM SP2 system at the Maui High-Performance Computing Center (MHPCC), on a 400-node Intel Paragon system at the San Diego Supercomputing Center (SDSC), and on a 128-node Cray T3D at the Cray Research Eagan Center in Wisconsin. Our prediction shows to be rather accurate compared with the actual performance measured on these machines. We use the SP2 data to illustrate the early prediction scheme. The main contribution of this work lies in providing a systematic procedure to estimate the computational work-load, to determine the application attributes, and to reveal the communication overhead in using these MPPs. These results can be applied to develop any MPP applications other than the STAP benchmarks by which this prediction scheme was developed.

References (29)

R.W. Hockney et al.
Public international benchmarks for parallel computers: PARKBENCH committee report no. 1
Scientific Computing
(1994)
Z. Xu et al.
Modeling communication overhead: MPI and MPL performance on the IBM SP2 system
IEEE Parallel and Distributed Technology
(1996)
D. Adams
R.J. Bergeron
The performance of the NAS HSPs in 1st half of 1994
D.P. Bertsekas et al.
L.N. Bhuyan et al.
R. Bond
Measuring performance and scalability using extended versions of the STAP processor benchmarks
P.Brinch Hansen
J.J. Dongarra, The performance database server (PDS): Reports: Linpack...

T. Fahringer

Estimating and optimizing performance for parallel programs

IEEE Computer

(1995)

G.C. Fox et al.

J.L. Gustafson

Reevaluating Amdahl's law

Comm. ACM

(1988)

R.W. Hockney

Performance parameters and benchmarking of supercomputers

Parallel Computing

(1991)

Cited by (25)

Optimizing locality and scalability of embedded Runge-Kutta solvers using block-based pipelining
2006, Journal of Parallel and Distributed Computing
The increasing gap between the speeds of processors and main memory has led to hardware architectures with an increasing number of caches to reduce average memory access times. Such deep memory hierarchies make the sequential and parallel efficiency of computer programs strongly dependent on their memory access pattern. In this paper, we consider embedded Runge–Kutta methods for the solution of ordinary differential equations and study their efficient implementation on different parallel platforms. In particular, we focus on ordinary differential equations which are characterized by a special access pattern as it results from the spatial discretization of partial differential equations by the method of lines. We explore how the potential parallelism in the stage vector computation of such equations can be exploited in a pipelining approach leading to a better locality behavior and a higher scalability. Experiments show that this approach results in efficiency improvements on several recent sequential and parallel computers.
Tlib - A library to support programming with hierarchical multi-processor tasks
2005, Journal of Parallel and Distributed Computing
The paper considers the modular programming with hierarchically structured multi-processor tasks on top of SPMD tasks for distributed memory machines. The parallel execution requires a corresponding decomposition of the set of processors into a hierarchical group structure onto which the tasks are mapped. The result is a multi-level group SPMD computation model with varying processor group structures. The advantage of this kind of mixed task and data parallelism is a potential to reduce the communication overhead and to increase scalability. We present a runtime library to support the coordination of hierarchically structured multi-processor tasks. The library exploits an extended parallel group SPMD programming model and manages the entire task execution including the dynamic hierarchy of processor groups. The library is built on top of MPI, has an easy-to-use interface, and leads to only a marginal overhead while allowing static planning and dynamic restructuring.
Benchmark evaluation of the message-passing overhead on modern parallel architectures
1998, Advances in Parallel Computing
The paper presented was inspired by an interesting investigation about the performance of MPI on an IBM RS/6000 SP machine [1]. The authors proposed a model for the evaluation of message-passing overhead and suggested to have an evaluation of message-passing performance on as many hardware platforms as possible. In some further investigations such evaluation was extended to other parallel platforms [2], [4], [5], [7], [9]. However, besides some restricted investigations [8] most recently presented MPP hardware platforms were left out in these investigations. Following the idea of [1] the paper presented analyzes performance of MPI on various most-recent-hardware platforms with respect to communication patterns used in parallel simulation software. This explicitly includes collective communication, especially across all processors used by the application, because it is often used in parallel simulation packages.
Workload forecasting framework for applications in cloud
2014, Proceedings of 2014 International Conference on Cloud Computing and Internet of Things, CCIOT 2014
Adaptive workload prediction of grid performance in confidence windows
2010, IEEE Transactions on Parallel and Distributed Systems
Scalability of time- and space-efficient embedded Runge-Kutta solvers for distributed address space
2009, Proceedings of the International Conference on Parallel Processing

View all citing articles on Scopus

View full text

Practical aspect and experienceEarly prediction of MPP performance: The SP2, T3D, and Paragon experiences

Abstract

Scientific Computing

IEEE Parallel and Distributed Technology

The performance of the NAS HSPs in 1st half of 1994

Measuring performance and scalability using extended versions of the STAP processor benchmarks

Estimating and optimizing performance for parallel programs

IEEE Computer

Reevaluating Amdahl's law

Comm. ACM

Performance parameters and benchmarking of supercomputers

Parallel Computing

Practical aspect and experience
Early prediction of MPP performance: The SP2, T3D, and Paragon experiences