Message scheduling for array re-decomposition on distributed memory systems

doi:10.1016/j.future.2008.10.009

Future Generation Computer Systems

Volume 26, Issue 2, February 2010, Pages 281-290

https://doi.org/10.1016/j.future.2008.10.009 Get rights and content

Abstract

For many parallel applications on distributed memory systems, array re-decomposition is usually required to enhance data locality and reduce the communication overheads. How to effectively schedule messages to improve the performance of array re-decomposition has received much attention in recent years. This paper is devoted to develop efficient scheduling algorithms using the compiling information provided by array distribution patterns, array alignment patterns and the periodic property of array accesses. Our algorithms not only avoid inter-processor contention, but also reduces real communication cost and communication generation time. The experimental results show that the performance of array redecomposition can be significantly improved using our algorithms

Introduction

One of the characteristic features of todays high performance computing systems is a physically distributed memory [1]. To efficiently execute a data parallel program on distributed memory systems, appropriate array decomposition is critical. The array decomposition consists of array distribution and array alignment. The array distribution distributes data arrays according to a specified distribution pattern. Block-cyclic data distribution pattern is the most general distribution in which the data array is distributed among processing elements (PEs) in a round-robin fashion. The array alignment aligns the data array with the distributed template array. The purpose of the array decomposition is to enhance data locality and reduce the communication overhead.

Many parallel applications [2], such as computational biology, computational electromagnetics (CEM) application, multidimensional Fast Fourier Transform (FFT), the alternative Direction Implicit (ADI) method, and linear algebra solvers, involve several separated loops. To improve the inter-loop data locality, array re-decomposition routines may be required among loops, which turn out to be critical operations during run-time as mentioned in [2], [3].

Many data parallel programming language, such as Chapel [4], Fortran D [5], Vienna Fortran [6], and HPF [7], provide supports for the array re-decomposition. Generally, the array re-decomposition [7] includes two levels, i.e. array realignment and array redistribution. The array realignment (or the array redistribution) is similar to the array alignment (or the array distribution), but is executable. An array can be redistributed (or realigned) at any time, if it is declared using the dynamic attribute. If there is no communication scheduling in the array re-decomposition, large amounts of communication idle time will degrade the performance of communication due to communication conflicts and the difference among message sizes in a communication phase [2].

To reduce the conflicts and difference, more and more researchers [2], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23] have paid attention to efficiently schedule the messages on distributed memory systems. However, they concentrated mainly on some special cases and array redistribution without array realignment. The main contributions of our work lie in three aspects:

(1)
We focus on the scheduling algorithms of array realignment as well as that of array redistribution.
(2)
The new scheduling algorithms can minimize the overhead of communication schedule generation using the information provided by array distribution patterns, array alignment patterns and the periodic property of array access.
(3)
We analyze and compare the performance of different array re-decompositions using our scheduling algorithms and other existing algorithms.

The rest of this paper is organized as follows: Section 2 gives the problem description and preliminaries. Section 3, the core of this paper, proposes the algorithm for improving array re-decomposition. Performance evaluation and experimental results will be shown in Section 4. Related work is discussed in Section 5. Finally, Section 6 presents our conclusions and future work.

Section snippets

Problem description and preliminaries

In this section, we first state the problem which we want to solve. Then, we present the terminologies used in this paper. In the last part of this section, we briefly give an overview of existing methods for array re-decomposition.

Optimized scheduling algorithms

We first give the periodic properties of array access according to periodic formulas. We then prove the periodic property (Lemma 1) of COM table so that the overhead of CS table generation is minimized as much as possible. To obtain entire COM table from its part under a period, we give the recurrent theorems (Theorem 1) of COM table elements. Based on these theories, we present communication scheduling algorithms to generate the CS table to avoid the communication conflicts and reduce the

Performance evaluation and experimental results

To evaluate the effect of the proposed algorithms, we compare 4 different algorithms including the Brute–Force algorithm (MPI_Alltoallv implementation in MPICH2 [24]), the Caterpillar algorithm [21], the Greedy algorithm [22] and our scheduling algorithms. The proposed scheduling algorithms have been incorporated into the CC-MPI [20]. The experimental environment is a 1Gbps Ethernet switched cluster with 36 Intel Xeon 3.0G /1024K Cache. Each node has 2GB memory and run RedHat Linux version FC

Related work

While a number of parallel applications [25], [26], [27] have been developed based on job/task scheduling, we focus on how to schedule messages to improve parallel applications. Recently, many researchers [3], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23] have concentrated on message scheduling for array re-decomposition. Substantially, these techniques can be classified into two categories according to the type of re-decomposition problem that

Conclusions and future work

The array re-decomposition is a necessary routine of data parallel applications. How to schedule messages for the array re-decomposition has received much attention in recent years. We have shown two scheduling algorithms for array re-decomposition according to different conditions. The condition for Algorithm 1 is based on the case that each source PE sends a message with the same size to a distinct target PE during a communications phase. For other cases, Algorithm 2 schedules the messages

Acknowledgements

We thank the anonymous reviewers for their insightful comments. The research is partially supported by the Hi-Tech Research and Development Program (863) of China under Grant No. 2006AA01Z105 and No. 2008AA01Z109, Natural Science Foundation of China under Grant No.60373008, and by the Key Project of Chinese Ministry of Education under Grant No. 106019 and No. 108008.

Mr. Jue Wang is a Ph.D. student at the School of Information Engineering at the University of Science and Technology Beijing, China. His research interests include parallel computing and parallel compilation technology.

References (27)

M. Guo et al.
Improving communication scheduling for array redistribution
J. Parallel Distrib. Comput.
(2005)
Z. Bozkus et al.
Compiling Fortran 90D/HPF for distributed memory MIMD computers
J. Parallel Distrib. Comput.
(1994)
M. Guo et al.
Contention-free communication scheduling for array redistribution
Parallel Comput.
(2000)
S. Ramaswamy et al.
Optimization for efficient data redistribution on distribution on distributed memory multi-computers
J. Parallel Distrib. Comput.
(1996)
A. Karwande et al.
An MPI prototype for compiled communication on ethernet switched clusters
J. Parallel and Distrib. Comput.
(2005)
R.S. Chang et al.
Job scheduling and data replication on data grids
Future Gener. Comput. Syst.
(2007)
Y.Y. Zhang et al.
Predict task running time in grid environments based on CPU load predictions
Future Gener. Comput. Syst.
(2008)
S. Schulz et al.
COHESION — A microkernel based desktop grid platform for irregular task-parallel applications
Future Gener. Comput. Syst.
(2008)
...
E. Jeannot et al.
Scheduling messages for data redistribution: An experimental study
Int. J. High Perf. Comput. App.
(2006)

Cray Inc., Chapel language specification 0.760,...

S. Benkner

VFC: The Vienna fortran compiler

Sci. Progr.

(1999)

HPF Forum, High Performance Fortran Language Specification, version 2.0 ed., Rice University, Houston, Texas, November...

Cited by (2)

Parallel algorithms for islanded microgrid with photovoltaic and energy storage systems planning optimization problem: Material selection and quantity demand optimization
2017, Computer Physics Communications
Citation Excerpt :
The planning model presented here is a large-scale MILP problem, it has 10 248 binaries, 49 732 rows, 33 502 columns, and 168 263 nonzero, and so the MIP tool provided by the CPLEX solver is utilized to cope with the optimization model. Some parameters that affect the solving performance are illustrated in Table 3[20,21]. According to the Balmorel model [22], three parameters, i.e., lpmethod, mipstart and mipemphasis have dramatic influence on the solving performance.
With the development of roof photovoltaic power (PV) generation technology and the increasingly urgent need to improve supply reliability levels in remote areas, islanded microgrid with photovoltaic and energy storage systems (IMPE) is developing rapidly. The high costs of photovoltaic panel material and energy storage battery material have become the primary factors that hinder the development of IMPE. The advantages and disadvantages of different types of photovoltaic panel materials and energy storage battery materials are analyzed in this paper, and guidance is provided on material selection for IMPE planners. The time sequential simulation method is applied to optimize material demands of the IMPE. The model is solved by parallel algorithms that are provided by a commercial solver named CPLEX. Finally, to verify the model, an actual IMPE is selected as a case system. Simulation results on the case system indicate that the optimization model and corresponding algorithm is feasible. Guidance for material selection and quantity demand for IMPEs in remote areas is provided by this method.
Parallel metropolis coupled markov chain monte carlo for isolation with migration model
2013, Applied Mathematics and Information Sciences

Dr. Changjun Hu is a professor and Ph.D. supervisor at the School of Information Engineering at the University of Science and Technology Beijing, China. His main research interests include parallel computing, parallel compilation technology, parallel software engineering, network storage system, data engineering and software engineering.

Mr. Jilin Zhang is a Ph.D. student at the School of Information Engineering at the University of Science and Technology Beijing, China. His research interests include parallel computing and parallel algorithm.

Dr. Jianjiang Li is an associate professor at the School of Information Engineering at the University of Science and Technology Beijing, China. His main research interests include parallel computation, parallel compilation and multi-threaded technology.

View full text

Message scheduling for array re-decomposition on distributed memory systems

Abstract

Introduction

Section snippets

Problem description and preliminaries

Optimized scheduling algorithms

Performance evaluation and experimental results

Related work

Conclusions and future work

Acknowledgements

J. Parallel Distrib. Comput.

J. Parallel Distrib. Comput.

Parallel Comput.

J. Parallel Distrib. Comput.

J. Parallel and Distrib. Comput.

Future Gener. Comput. Syst.

Future Gener. Comput. Syst.

Future Gener. Comput. Syst.

Scheduling messages for data redistribution: An experimental study

Int. J. High Perf. Comput. App.

VFC: The Vienna fortran compiler

Sci. Progr.