Elsevier

Parallel Computing

Volume 28, Issue 3, March 2002, Pages 415-432
Parallel Computing

Practical aspects
Optimizing Local Performance in HPF

https://doi.org/10.1016/S0167-8191(01)00148-XGet rights and content

Abstract

High Performance Fortran (HPF) was created to simplify high-level programming on parallel computers. The inventors of HPF strove for an easy-to-use language which was intended to enable portability and efficiency. However, until now the desired efficiency has not been reached. On the contrary, HPF programs are notorious for their poor performance.

This paper provides a rehabilitation of HPF. It is demonstrated how currently available HPF constructs can be utilized to solve sizeable numerical problems efficiently. The method suggested utilizes HPF's EXTRINSIC mechanism to integrate existing numerical single processor software for computationally expensive kernels into HPF programs.

By using the technique described in this paper, the empirical efficiency, i.e., the ratio of the empirical floating-point performance to the theoretical peak performance, can be raised to 50% and more. Even on message-passing machines with slow communication networks, such as PC clusters (Beowulf clusters) using a 100 Mbit/s Ethernet interconnection, highly satisfactory empirical efficiency results. The performance achieved is even competitive with that of well-established numerical libraries based on MPI.

In contrast to earlier approaches for utilizing existing numerical software in HPF programs, the method presented here uses only HPF features and is therefore portable.

Introduction

High Performance Fortran (HPF [19]) is a programming language which provides high-level support for the development of parallel programs. It has been designed primarily for regular, data-parallel applications. One of the central goals of HPF is to combine high performance with portability across a wide range of (distributed memory) parallel computers.

Since the introduction of HPF 1.0 [18] in 1992 by the High Performance Fortran Forum, two important trends have influenced its further development.

  • On the one hand, compiler producers were reluctant to invest a lot of effort into the development of high quality HPF compilers, because it was unclear as to how quickly good compilers for Fortran 90/95, which forms the basis of HPF, would become available. Moreover, it was somewhat controversial as to how successful HPF, and more generally, automatic parallelization tools for distributed memory parallel computers, could ever be.

  • Users, on the other hand, soon started to demand more advanced features to deal with a broader range of applications than just the regular data-parallel ones. In particular, support for irregular distributions and for task parallelism was investigated (see, for instance, [4], [5], [11], [12], [22], [27]).


As a consequence of the first trend, several central features of HPF 1.0, which are hard to implement efficiently, were taken out of the core language and given the status of “Approved Extensions”. At the same time, according to the second trend, new language constructs, mainly directed towards irregular applications, were developed and incorporated into the Approved Extensions. The result of these activities is the current version of the language definition, HPF 2.0 [19].

HPF is a conceptually very elegant and attractive approach to high-level parallel programming. In particular, it provides very convenient ways for specifying data distribution and for expressing data-parallelism. The development of parallel code using HPF is easier and requires less coding effort than explicitly programming message-passing, for example, using MPI [23]. Nevertheless, HPF has not yet become widely accepted by users for several reasons [16].

  • It took a long time for the language standard HPF 2.0 [19] to be fully supported by commercial compilers. Only recently have mature compilers become available. Unfortunately, several companies have suspended their development of HPF compilers because the market has been assessed as being too small.

  • The few HPF compilers which are available are not able to deliver acceptable performance. In fact, in many cases the performance of HPF codes is so inferior to explicit message-passing programming that HPF cannot be considered a competitive alternative despite its obvious advantages in terms of code development, flexibility, maintenance, portability, and debugging.


In this paper a concept for achieving high performance using HPF is introduced, which opens up new perspectives. The central idea is to utilize existing high quality (sequential) software for local computations.

Previous work in this area has concentrated on interfacing HPF to extrinsic parallel library routines from numerical software packages such as ScaLapack [7]. In this approach the extrinsic parallel library routine performs communication as well as computation and the main issue is how to efficiently pass distribution information from HPF to the extrinsic parallel routine. In contrast, the focus of our method is on interfacing HPF to sequential library routines. HPF is used as a framework for conveniently distributing data and for organizing high-level parallelism.

Blackford et al. [8] have developed SLHPF, an interface from HPF to ScaLapack. This interface uses three layers of wrapper routines: (i) global HPF routines, which call (ii) HPF_LOCAL routines, which in turn call (iii) Fortran 77 routines containing calls to the Blacs and to ScaLapack. The current version of the SLHPF interface contains HPF wrapper routines for ScaLapack's LU and Cholesky factorization and wrapper routines for the Level 3 PBlas routines PBLAS/pdgemm and PBLAS/pdtrsm, among others. The wrapper routines do not have a significant influence on the overall execution time (see [8]). Usually most of the total computation time is spent in ScaLapack routines, and the performance of HPF routines which call ScaLapack routines via the SLHPF interface is similar to the performance of directly called ScaLapack routines. However, if the data distribution of the HPF code is not compatible with ScaLapack, the matrices have to be redistributed at the interface. In this case, performance may deteriorate significantly.

Lorenzo et al. [21] developed another prototypical interface from HPF to ScaLapack for an earlier version of the PGI HPF compiler.

The public domain HPF compilation system Adaptor [9] also contains an interface to ScaLapack. This interface is directly built into the runtime system of the Adaptor system, which makes it possible to bypass the EXTRINSIC mechanism of HPF. In contrast to these approaches, which require a rather complicated special interface to the parallel library routines, the method described in this paper only requires the EXTRINSIC interface which is part of the HPF standard.

From a more general point of view, there are obviously numerous other approaches in the area of automatic parallelization such as OpenMP [10], [25]. However, it is not our intention to discuss the state-of-the-art or the future potential of automatic parallelization as a whole. Instead, we focus on HPF as a specific example. We show how highly optimized library routines can be made accessible to HPF and illustrate that significant performance improvements are possible by supporting an automatic parallelization tool like HPF with such external building blocks.

In this paper it is shown that HPF programs can meet high performance expectations when combined with existing optimized library routines. It is also shown that HPF programs utilizing sequential library routines are preferable to HPF programs utilizing parallel libraries in many cases. This is not only due to the fact that the HPF constructs required when utilizing sequential routines are already available in HPF compilers and that, therefore, the resulting code is highly portable; it is also due to the fact that this approach is less sensitive to the data distribution in terms of performance.

In Section 2 our basic methodology for integrating sequential library routines into HPF programs is introduced. In Section 3 it is shown how this concept can be applied successfully to some very important operations in high performance scientific computing: general and symmetric matrix operations, Cholesky factorization, and two-dimensional FFT. In Section 4 the results of numerical experiments carried out on various parallel computers are summarized.

Section snippets

HPF and numerical libraries

A considerable amount of expertise has been incorporated into high quality software like the Blas [13], [14], [20], into packages which use the Blas as building blocks like Lapack [2], the standard sequential package for dense or banded matrices, and into parallel packages such as ScaLapack [7] or PLapack [29]. These sequential and parallel libraries can be seen as the result of several decades of combined effort in the field of numerical computing. It seems useful to make these optimized

Examples

To demonstrate the usefulness of the new approach, basic linear algebra operations like matrix multiplication routines as well as higher level algorithms like Cholesky factorization and 2D FFT were implemented. The linear algebra codes utilize Blas routines for local computation. The 2D FFT code calls a sequential FFTpack2 routine for the local computation of 1D FFTs. As mentioned before, the ultimate goal would be to create a complete library of

Numerical experiments

In this chapter a short overview of the hardware platforms used in the experiments and a description of the results are given.

Conclusion

In this paper, HPF is shown to support the construction of codes which are portable across important parallel architectures, while achieving a highly satisfactory performance.

It has been demonstrated that subroutines of existing numerical high performance libraries can be integrated into HPF code in a portable and efficient way using HPF language features only. This new approach guarantees high local performance in HPF programs and yields significant performance improvements compared to native

Acknowledgements

We want to express our gratitude to John Merlin for many helpful discussions, to Herbert Karner for providing us with knowledge and example codes for Section 3.4 and Table 2, and to the Austrian Science Fund (FWF) for financial support.

References (32)

  • S Benkner

    HPF+: High performance fortran for advanced scientific and engineering applications

    Future Generation Computer Systems

    (1999)
  • B.S Andersen et al.

    Recursive formulation of some dense linear algebra algorithms

  • E Anderson et al.

    Lapack Users' Guide

    (1999)
  • M. Auer, R. Benedik, F. Franchetti, H. Karner, P. Kristöfel, R. Schachinger, A. Slateff, C.W. Ueberhuber, Performance...
  • S Benkner

    Optimizing irregular HPF applications using halos

    Concurrency: Practice and Experience

    (2000)
  • J Bilmes et al.

    Optimizing matrix multiply using PhiPac: A portable, high-performance, ANSI C coding methodology

  • L.S Blackford et al.

    ScaLapack Users' Guide

    (1997)
  • L.S. Blackford, J.J. Dongarra, C.A. Papadopoulos, R.C. Whaley, Installation Guide and Design of the HPF 1.1 Interface...
  • T Brandes et al.

    Realization of an HPF interface to ScaLapack with redistributions

  • R Chandra et al.

    Parallel Programming in OpenMP

    (2000)
  • B.M Chapman et al.

    OPUS: A coordination language for multidisciplinary applications

    Scientific Programming

    (1997)
  • R Das et al.

    Applying the CHAOS/PARTI library to irregular problems in computation chemistry and computational aerodynamics

  • J.J Dongarra et al.

    A set of level 3 basic linear algebra subprograms

    ACM Transactions on Mathematical Software

    (1990)
  • J.J Dongarra et al.

    An extended set of basic linear algebra subprograms

    ACM Transactions on Mathematical Software

    (1988)
  • H.J Ehold et al.

    HPF and numerical libraries

  • H.J. Ehold, W.N. Gansterer, C.W. Ueberhuber, HPF – State of the art, Technical Report AURORA TR1998-01, Vienna...
  • Cited by (2)

    • Beowulf clusters for parallel programming courses

      2005, EUROCON 2005 - The International Conference on Computer as a Tool
    View full text