A schedule cache for data parallel unstructured computations

doi:10.1016/S0167-8191(00)00056-9

Parallel Computing

Volume 26, Issues 13–14, December 2000, Pages 1807-1823

https://doi.org/10.1016/S0167-8191(00)00056-9 Get rights and content

Abstract

High Performance Fortran (HPF) is the de facto standard language for writing data parallel programs. This study describes a dynamic approach for optimizing unstructured communication in codes with indirect addressing. The basic idea is that the run-time data reflecting the communication patterns will be reused if possible. The user has only to specify which data in the program have to be traced for modifications. The experiments and results show the effectiveness of the chosen approach.

Introduction

Data parallel languages, and especially High Performance Fortran (HPF) [11], [12] have been designed to make possible efficient and high-level parallel programming for distributed memory machines, in contrast to the low-level concurrent programming model based on explicit message-passing primitives. For distributed memory architectures, HPF compilers emulate the global address space by distributing the arrays among the processors according to the mapping directives of the user and by generating automatically explicit interprocessor communication.

Data parallel languages are challenged by irregular applications. These applications use indirect addressing on distributed arrays, i.e., gather and scatter operations. The dynamic nature of such computations does not provide much information to compute at compile-time, because the addressing scheme depends on the indirection array. For irregular parallel assignments, complex data structures, called schedules, are required for accessing remote items of distributed arrays and for communication optimization. The structures can be worked out only at run-time. The code design which first builds the schedule, then uses it to carry out the actual communication and computation, has been coined as the inspector/executor scheme [19], and is widely used by research [21], [13], [22], [6] and commercial [18] compilers.

The pre-processing associated with the inspector creates a large run-time overhead, because it involves a large amount of computation and all-to-all communications. Hence, the inspector/executor scheme targets applications which exhibit spatial or temporal locality. Spatial locality eliminates communication for non-local data whose actual value is already available or that will be accessed more than once. This study focuses on temporal locality. Temporal locality makes it possible to reuse a previously computed schedule when the communication pattern has not changed, even if the data have been modified. In this case, the inspector cost is amortized over multiple schedule reuses.

This study presents a new protocol to handle temporal locality. It has been implemented and evaluated in the framework of the automatic data parallelism translator (ADAPTOR) HPF compilation system [1]. The protocol is called URB, because the run-time system can either Use, Refresh or Build a schedule for each parallel irregular assignment. At the language level, the protocol implementation is limited to a directive, namely TRACE. This directive provides an attribute to the indirection arrays, which is coherent with the semantics of other HPF array-related directives (ALIGN, DISTRIBUTE). The main properties of URB are:

•
a schedule can be reused for all parallel assignments that share the same structure parameters and the same values of indirection arrays;
•
the schedule information is completely handled by the compiler and the run-time system;
•
reuse through procedure calls does not require inter-procedural analysis;
•
codes using the trace directive and other codes (e.g. library routines) can securely be mixed, of course losing the performance benefits of schedule reuse for non-TRACE-aware routines.

The main result of the paper is that HPF (and more generally, data-parallel languages) can efficiently handle data-parallel irregular applications, as shown in the performance results section. Most of the work targeting irregular applications has chosen another path, namely parallel libraries based on task parallelism, possibly coordinating data-parallel regular programs, such as in KeLP [16]. Both approaches require hard work for data localization, which is outside the scope of this study. However, when this work is done, a very modest effort of adding an intuitive directive suffices to get good and scalable performance inside the HPF framework.

The rest of the paper is organized as follows: Section 2 describes the realization of indirect addressing of distributed arrays via the inspector/executor scheme, and the opportunities to reuse schedules. Section 3 describes the protocol that is needed for tracking indirection arrays and how it can be realized. Related work is discussed in Section 4. Section 5 describes the implementation of our ideas in the HPF compilation system ADAPTOR. Section 6 presents performance results, and we conclude in Section 7.

Section snippets

Temporal locality

When compiling HPF for distributed memory machines, indirect addressing of distributed arrays requires unstructured communication that must be handled by complex run-time support. The code fragment in Fig. 1 shows unstructured communications with various opportunities for schedule reuse. Consider the first assignment (a). During the inspector phase, all processors check which of its values of L stand for non-local values of A and compute the owner. The processors exchange this information to

Dynamic tracing

The run-time approach for reusing communication schedules needs a concept to verify at run-time that the involved indirection arrays and the mask have not been modified. As the mask can be considered as an additional index, only indirection arrays are considered in the following. If an indirection array L is involved in only one schedule, to allow such tracking, it suffices that the descriptor of L includes a dirty flag. Each write to L sets its dirty flag. The schedule can thus, be reused if

Related work

Run-time compilation techniques for irregular computations in a distributed memory framework were pioneered very early [14], [19]. The PARTI library provides an extensive set of optimizations related to building and merging schedules, and reusing schedules for off-processor data [8], [3], focused on spatial locality. The CHAOS library [20] improves PARTI by targeting dynamic irregular applications. PARTI and CHAOS have been experienced both through direct use by the programmer [15], where

Overview of the system

ADAPTOR is a public domain HPF compilation system developed at GMD for compiling data parallel HPF programs to equivalent message passing programs [1].

By means of a source-to-source transformation, ADAPTOR translates the data parallel program to an equivalent SPMD program (single program, multiple data) that is executed by all processors in their local address space. Every processor only allocates that portion of a distributed array that is owned by it. Computation partitioning (work

A detailed analysis of a gather

The performance of the URB protocol was studied for a simple gather operation (T(i) = A(L(i))), in order to make evident the various components of the inspector-executor scheme. Three schemes of indirect addressing have been considered: identity (L(i)=i), a cyclic shift, and uniform random values. The random and shift distributions stand for two typical degrees of irregularity, heavy and light. Identity is useful to measure pure overhead. It also describes data-dependent locality, where no

Conclusions

In this paper, we have presented a strategy to implement effectively an inspector-executor paradigm where communication schedules can be reused. It requires no more data-flow and inter-procedural analysis at compile-time. It will also reuse communication schedules in situations where even the best compile-time analysis might fail.

Although the method described here requires a new language construct, the TRACE directive, the user does not have to track the multiple execution paths. This should be

Acknowledgements

We acknowledge the CRI and the CNUSC for access to their IBM SP-2 and appreciate the comments and “ food for thought ” of the reviewers.

References (22)

R. Das
Communication optimization for irregular scientific computations on distributed memory architectures
Journal of Parallel and Distributed Computing
(1994)
S. Hiranandani et al.
Performance of hashed cache migration schemes on multicomputers
Journal of Parallel and Distributed Computing
(1991)
J. Merlin et al.
Multiple data parallelism with HPF and KeLP
J. Future Generation Computer Systems
(1999)
ADAPTOR, High Performance Fortran Compilation System. WWW Documentation, Institute for Algorithms and Scientific...
G. Agrawal, J. Saltz, Interprocedural communication optimizations for distributed memory compilation, Language and...
G. Agrawal, A. Sussman, J. Saltz, Compiler and run-time support for structured and block-structured applications,...
S. Benkner, H. Zima, Definition of HPF+ Rel.2. Technical Report, HPF+ Consortium,...
T. Brandes, ADAPTOR Programmer's Guide (Version 5.0). Technical Documentation, GMD, May 1997. Available via anonymous...
T. Brandes, F. Zimmermann, ADAPTOR- A transformation tool for HPF programs, in: K. Decker, R. Rehmann (Eds.),...
T. Brandes, F. Zimmermann, C. Borel, M. Brédif, Evaluation of high performance Fortran for an industrial computational...

C. Germain, J. Laminie, M. Pallud, D. Etiemble, An HPF case study of a domain-decomposition based irregular...

Cited by (0)

View full text