A schedule cache for data parallel unstructured computations
Introduction
Data parallel languages, and especially High Performance Fortran (HPF) [11], [12] have been designed to make possible efficient and high-level parallel programming for distributed memory machines, in contrast to the low-level concurrent programming model based on explicit message-passing primitives. For distributed memory architectures, HPF compilers emulate the global address space by distributing the arrays among the processors according to the mapping directives of the user and by generating automatically explicit interprocessor communication.
Data parallel languages are challenged by irregular applications. These applications use indirect addressing on distributed arrays, i.e., gather and scatter operations. The dynamic nature of such computations does not provide much information to compute at compile-time, because the addressing scheme depends on the indirection array. For irregular parallel assignments, complex data structures, called schedules, are required for accessing remote items of distributed arrays and for communication optimization. The structures can be worked out only at run-time. The code design which first builds the schedule, then uses it to carry out the actual communication and computation, has been coined as the inspector/executor scheme [19], and is widely used by research [21], [13], [22], [6] and commercial [18] compilers.
The pre-processing associated with the inspector creates a large run-time overhead, because it involves a large amount of computation and all-to-all communications. Hence, the inspector/executor scheme targets applications which exhibit spatial or temporal locality. Spatial locality eliminates communication for non-local data whose actual value is already available or that will be accessed more than once. This study focuses on temporal locality. Temporal locality makes it possible to reuse a previously computed schedule when the communication pattern has not changed, even if the data have been modified. In this case, the inspector cost is amortized over multiple schedule reuses.
This study presents a new protocol to handle temporal locality. It has been implemented and evaluated in the framework of the automatic data parallelism translator (ADAPTOR) HPF compilation system [1]. The protocol is called URB, because the run-time system can either Use, Refresh or Build a schedule for each parallel irregular assignment. At the language level, the protocol implementation is limited to a directive, namely TRACE. This directive provides an attribute to the indirection arrays, which is coherent with the semantics of other HPF array-related directives (ALIGN, DISTRIBUTE). The main properties of URB are:
- •
a schedule can be reused for all parallel assignments that share the same structure parameters and the same values of indirection arrays;
- •
the schedule information is completely handled by the compiler and the run-time system;
- •
reuse through procedure calls does not require inter-procedural analysis;
- •
codes using the trace directive and other codes (e.g. library routines) can securely be mixed, of course losing the performance benefits of schedule reuse for non-TRACE-aware routines.
The rest of the paper is organized as follows: Section 2 describes the realization of indirect addressing of distributed arrays via the inspector/executor scheme, and the opportunities to reuse schedules. Section 3 describes the protocol that is needed for tracking indirection arrays and how it can be realized. Related work is discussed in Section 4. Section 5 describes the implementation of our ideas in the HPF compilation system ADAPTOR. Section 6 presents performance results, and we conclude in Section 7.
Section snippets
Temporal locality
When compiling HPF for distributed memory machines, indirect addressing of distributed arrays requires unstructured communication that must be handled by complex run-time support. The code fragment in Fig. 1 shows unstructured communications with various opportunities for schedule reuse. Consider the first assignment (a). During the inspector phase, all processors check which of its values of L stand for non-local values of A and compute the owner. The processors exchange this information to
Dynamic tracing
The run-time approach for reusing communication schedules needs a concept to verify at run-time that the involved indirection arrays and the mask have not been modified. As the mask can be considered as an additional index, only indirection arrays are considered in the following. If an indirection array L is involved in only one schedule, to allow such tracking, it suffices that the descriptor of L includes a dirty flag. Each write to L sets its dirty flag. The schedule can thus, be reused if
Related work
Run-time compilation techniques for irregular computations in a distributed memory framework were pioneered very early [14], [19]. The PARTI library provides an extensive set of optimizations related to building and merging schedules, and reusing schedules for off-processor data [8], [3], focused on spatial locality. The CHAOS library [20] improves PARTI by targeting dynamic irregular applications. PARTI and CHAOS have been experienced both through direct use by the programmer [15], where
Overview of the system
ADAPTOR is a public domain HPF compilation system developed at GMD for compiling data parallel HPF programs to equivalent message passing programs [1].
By means of a source-to-source transformation, ADAPTOR translates the data parallel program to an equivalent SPMD program (single program, multiple data) that is executed by all processors in their local address space. Every processor only allocates that portion of a distributed array that is owned by it. Computation partitioning (work
A detailed analysis of a gather
The performance of the URB protocol was studied for a simple gather operation (T(i) = A(L(i))), in order to make evident the various components of the inspector-executor scheme. Three schemes of indirect addressing have been considered: identity (L(i)=i), a cyclic shift, and uniform random values. The random and shift distributions stand for two typical degrees of irregularity, heavy and light. Identity is useful to measure pure overhead. It also describes data-dependent locality, where no
Conclusions
In this paper, we have presented a strategy to implement effectively an inspector-executor paradigm where communication schedules can be reused. It requires no more data-flow and inter-procedural analysis at compile-time. It will also reuse communication schedules in situations where even the best compile-time analysis might fail.
Although the method described here requires a new language construct, the TRACE directive, the user does not have to track the multiple execution paths. This should be
Acknowledgements
We acknowledge the CRI and the CNUSC for access to their IBM SP-2 and appreciate the comments and “ food for thought ” of the reviewers.
References (22)
Communication optimization for irregular scientific computations on distributed memory architectures
Journal of Parallel and Distributed Computing
(1994)- et al.
Performance of hashed cache migration schemes on multicomputers
Journal of Parallel and Distributed Computing
(1991) - et al.
Multiple data parallelism with HPF and KeLP
J. Future Generation Computer Systems
(1999) - ADAPTOR, High Performance Fortran Compilation System. WWW Documentation, Institute for Algorithms and Scientific...
- G. Agrawal, J. Saltz, Interprocedural communication optimizations for distributed memory compilation, Language and...
- G. Agrawal, A. Sussman, J. Saltz, Compiler and run-time support for structured and block-structured applications,...
- S. Benkner, H. Zima, Definition of HPF+ Rel.2. Technical Report, HPF+ Consortium,...
- T. Brandes, ADAPTOR Programmer's Guide (Version 5.0). Technical Documentation, GMD, May 1997. Available via anonymous...
- T. Brandes, F. Zimmermann, ADAPTOR- A transformation tool for HPF programs, in: K. Decker, R. Rehmann (Eds.),...
- T. Brandes, F. Zimmermann, C. Borel, M. Brédif, Evaluation of high performance Fortran for an industrial computational...