Message scheduling for array re-decomposition on distributed memory systems
Introduction
One of the characteristic features of todays high performance computing systems is a physically distributed memory [1]. To efficiently execute a data parallel program on distributed memory systems, appropriate array decomposition is critical. The array decomposition consists of array distribution and array alignment. The array distribution distributes data arrays according to a specified distribution pattern. Block-cyclic data distribution pattern is the most general distribution in which the data array is distributed among processing elements (PEs) in a round-robin fashion. The array alignment aligns the data array with the distributed template array. The purpose of the array decomposition is to enhance data locality and reduce the communication overhead.
Many parallel applications [2], such as computational biology, computational electromagnetics (CEM) application, multidimensional Fast Fourier Transform (FFT), the alternative Direction Implicit (ADI) method, and linear algebra solvers, involve several separated loops. To improve the inter-loop data locality, array re-decomposition routines may be required among loops, which turn out to be critical operations during run-time as mentioned in [2], [3].
Many data parallel programming language, such as Chapel [4], Fortran D [5], Vienna Fortran [6], and HPF [7], provide supports for the array re-decomposition. Generally, the array re-decomposition [7] includes two levels, i.e. array realignment and array redistribution. The array realignment (or the array redistribution) is similar to the array alignment (or the array distribution), but is executable. An array can be redistributed (or realigned) at any time, if it is declared using the dynamic attribute. If there is no communication scheduling in the array re-decomposition, large amounts of communication idle time will degrade the performance of communication due to communication conflicts and the difference among message sizes in a communication phase [2].
To reduce the conflicts and difference, more and more researchers [2], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23] have paid attention to efficiently schedule the messages on distributed memory systems. However, they concentrated mainly on some special cases and array redistribution without array realignment. The main contributions of our work lie in three aspects:
- (1)
We focus on the scheduling algorithms of array realignment as well as that of array redistribution.
- (2)
The new scheduling algorithms can minimize the overhead of communication schedule generation using the information provided by array distribution patterns, array alignment patterns and the periodic property of array access.
- (3)
We analyze and compare the performance of different array re-decompositions using our scheduling algorithms and other existing algorithms.
Section snippets
Problem description and preliminaries
In this section, we first state the problem which we want to solve. Then, we present the terminologies used in this paper. In the last part of this section, we briefly give an overview of existing methods for array re-decomposition.
Optimized scheduling algorithms
We first give the periodic properties of array access according to periodic formulas. We then prove the periodic property (Lemma 1) of COM table so that the overhead of CS table generation is minimized as much as possible. To obtain entire COM table from its part under a period, we give the recurrent theorems (Theorem 1) of COM table elements. Based on these theories, we present communication scheduling algorithms to generate the CS table to avoid the communication conflicts and reduce the
Performance evaluation and experimental results
To evaluate the effect of the proposed algorithms, we compare 4 different algorithms including the Brute–Force algorithm (MPI_Alltoallv implementation in MPICH2 [24]), the Caterpillar algorithm [21], the Greedy algorithm [22] and our scheduling algorithms. The proposed scheduling algorithms have been incorporated into the CC-MPI [20]. The experimental environment is a 1Gbps Ethernet switched cluster with 36 Intel Xeon 3.0G /1024K Cache. Each node has 2GB memory and run RedHat Linux version FC
Related work
While a number of parallel applications [25], [26], [27] have been developed based on job/task scheduling, we focus on how to schedule messages to improve parallel applications. Recently, many researchers [3], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23] have concentrated on message scheduling for array re-decomposition. Substantially, these techniques can be classified into two categories according to the type of re-decomposition problem that
Conclusions and future work
The array re-decomposition is a necessary routine of data parallel applications. How to schedule messages for the array re-decomposition has received much attention in recent years. We have shown two scheduling algorithms for array re-decomposition according to different conditions. The condition for Algorithm 1 is based on the case that each source PE sends a message with the same size to a distinct target PE during a communications phase. For other cases, Algorithm 2 schedules the messages
Acknowledgements
We thank the anonymous reviewers for their insightful comments. The research is partially supported by the Hi-Tech Research and Development Program (863) of China under Grant No. 2006AA01Z105 and No. 2008AA01Z109, Natural Science Foundation of China under Grant No.60373008, and by the Key Project of Chinese Ministry of Education under Grant No. 106019 and No. 108008.
Mr. Jue Wang is a Ph.D. student at the School of Information Engineering at the University of Science and Technology Beijing, China. His research interests include parallel computing and parallel compilation technology.
References (27)
- et al.
Improving communication scheduling for array redistribution
J. Parallel Distrib. Comput.
(2005) - et al.
Compiling Fortran 90D/HPF for distributed memory MIMD computers
J. Parallel Distrib. Comput.
(1994) - et al.
Contention-free communication scheduling for array redistribution
Parallel Comput.
(2000) - et al.
Optimization for efficient data redistribution on distribution on distributed memory multi-computers
J. Parallel Distrib. Comput.
(1996) - et al.
An MPI prototype for compiled communication on ethernet switched clusters
J. Parallel and Distrib. Comput.
(2005) - et al.
Job scheduling and data replication on data grids
Future Gener. Comput. Syst.
(2007) - et al.
Predict task running time in grid environments based on CPU load predictions
Future Gener. Comput. Syst.
(2008) - et al.
COHESION — A microkernel based desktop grid platform for irregular task-parallel applications
Future Gener. Comput. Syst.
(2008) - ...
- et al.
Scheduling messages for data redistribution: An experimental study
Int. J. High Perf. Comput. App.
(2006)
VFC: The Vienna fortran compiler
Sci. Progr.
Cited by (2)
Parallel algorithms for islanded microgrid with photovoltaic and energy storage systems planning optimization problem: Material selection and quantity demand optimization
2017, Computer Physics CommunicationsCitation Excerpt :The planning model presented here is a large-scale MILP problem, it has 10 248 binaries, 49 732 rows, 33 502 columns, and 168 263 nonzero, and so the MIP tool provided by the CPLEX solver is utilized to cope with the optimization model. Some parameters that affect the solving performance are illustrated in Table 3[20,21]. According to the Balmorel model [22], three parameters, i.e., lpmethod, mipstart and mipemphasis have dramatic influence on the solving performance.
Parallel metropolis coupled markov chain monte carlo for isolation with migration model
2013, Applied Mathematics and Information Sciences
Mr. Jue Wang is a Ph.D. student at the School of Information Engineering at the University of Science and Technology Beijing, China. His research interests include parallel computing and parallel compilation technology.
Dr. Changjun Hu is a professor and Ph.D. supervisor at the School of Information Engineering at the University of Science and Technology Beijing, China. His main research interests include parallel computing, parallel compilation technology, parallel software engineering, network storage system, data engineering and software engineering.
Mr. Jilin Zhang is a Ph.D. student at the School of Information Engineering at the University of Science and Technology Beijing, China. His research interests include parallel computing and parallel algorithm.
Dr. Jianjiang Li is an associate professor at the School of Information Engineering at the University of Science and Technology Beijing, China. His main research interests include parallel computation, parallel compilation and multi-threaded technology.