Elsevier

Journal of Computational Physics

Volume 297, 15 September 2015, Pages 237-253
Journal of Computational Physics

Comparing Coarray Fortran (CAF) with MPI for several structured mesh PDE applications

https://doi.org/10.1016/j.jcp.2015.05.020Get rights and content

Abstract

Language-based approaches to parallelism have been incorporated into the Fortran standard. These Fortran extensions go under the name of Coarray Fortran (CAF) and full-featured compilers that support CAF have become available from Cray and Intel; the GNU implementation is expected in 2015. CAF combines elegance of expression with simplicity of implementation to yield an efficient parallel programming language. Elegance of expression results in very compact parallel code. The existence of a standard helps with portability and maintainability. CAF was designed to excel at one-sided communication and similar functions that support one-sided communication are also available in the recent MPI-3 standard. One-sided communication is expected to be very valuable for structured mesh applications involving partial differential equations, amongst other possible applications. This paper focuses on a comparison of CAF and MPI for a few very useful applications areas that are routinely used for solving partial differential equations on structured meshes. The three specific areas are Fast Fourier Techniques, Computational Fluid Dynamics, and Multigrid Methods.

For each of those applications areas, we have developed optimized CAF code and optimized MPI code that is based on the one-sided messaging capabilities of MPI-3. Weak scalability studies that compare CAF and MPI-3 are presented on up to 65,536 processors. Both paradigms scale well, showing that they are well-suited for Petascale-class applications. Some of the applications shown (like Fast Fourier Techniques and Computational Fluid Dynamics) require large, coarse-grained messaging. Such applications emphasize high bandwidth. Our other application (Multigrid Methods) uses pointwise smoothers which require a large amount of fine-grained messaging. In such applications, a premium is placed on low latency. Our studies show that both CAF and MPI-3 offer the twin advantages of high bandwidth and low latency for messages of all sizes. Even for large numbers of processors, CAF either draws level with MPI-3 or shows a slight advantage over MPI-3. Both CAF and MPI-3 are shown to provide substantial advantages over MPI-2.

In addition to the weak scalability studies, we also catalogue some of the best-usage strategies that we have found for our successful implementations of one-sided messaging in CAF and MPI-3. We show that CAF code is of course much easier to write and maintain, and the simpler syntax makes the parallelism easier to understand.

Introduction

A significant fraction of the resources of any parallel supercomputer are devoted to the solution of partial differential equations (PDEs). As a result, it is very interesting to study the performance of novel parallel programming paradigms and compare them to existing ones. In this paper, we compare two of the most popular recent parallel programming paradigms and they are discussed in a little detail below. We focus on the performance of several PDE solvers on structured meshes and the reason for choosing these applications is also explained. Given the availability of supercomputers with hundreds of thousands of processors, it is also interesting to catalogue the performance of our PDE solvers on many thousands of processors.

The Message Passing Interface (MPI) has long been a mainstay of parallel programming (Gropp, Lusk and Skjellum [12]). While one-sided messaging was first presented in MPI-2, it has been greatly modified in the recently-introduced MPI-3 standard with an eye to improving its usability (MPI Forum [15]). One-sided messaging has the potential to offer great benefits in the solution of PDE systems because the computation proceeds in predictable fashion, with large tracts of code devoted to CPU usage followed by a messaging step where a small or large number of messages of varying sizes may be exchanged. For structured mesh PDE applications, those messages are usually of predictable size and are intended to fill various sections of multidimensional arrays. Consequently, the messages can be properly buffered on both sides of the communication and non-blocking get or put operations can be used. As a result, it is expected that one-sided messaging will fulfill an important need for the solution of PDEs on structured meshes. (For the sake of completeness, it is worth pointing out that most vendors' implementation of MPI-3 is based on Argonne National Lab's MPICH. Consequently, most current MPI-3 implementations are based on MPI-2, which prevents them from realizing the full potential of MPI-3. The Cray implementation of MPI-3 is unique because it can be made to emulate truly one-sided messaging by drawing on Cray's unique architecture. Details for doing this are provided in Appendix A. For this reason, all our results with MPI-3 have been obtained on Cray architectures.)

Another recent line of development comes from Coarray Fortran (CAF), which has been incorporated into the current Fortran standard (known informally as Fortran 2008). At the time of writing, production grade compilers from Cray and Intel that incorporate the standard have become available and GNU will follow in 2015. A video introduction to CAF is available from http://www.nd.edu/dbalsara/Numerical_PDE_Course. CAF provides support for non-blocking one-sided get or put operations. As a result, it is expected to excel for the solution of PDEs on structured meshes. While CAF and MPI both fulfill on a common need, they are based on different philosophies which have different consequences for the end user. (Just as MPI-3 is an emergent paradigm for parallel computing, so is CAF. Its performance can also be tweaked by drawing on Cray's unique architecture. Details for doing this are provided in Appendix B. All our CAF results have also been obtained on Cray architectures.)

MPI is a library-based approach. It is, therefore, easily extensible, portable and not tied to any one compiler. However, the user then becomes bound to the static features of the library and dependent on the vendor's (or system administrator's) fine-tuning of the MPI library. CAF is a compiler-based approach. For example, it might make several alterations to remote data in cache or memory without sending the data back to its owner until the next synchronization and it might reorder statements to allow remote data required in more than one statement to be accessed together. The user necessarily becomes dependent on an available compiler for a particular architecture. This is not much of a problem in a modern setting, since most supercomputer architectures have tended to converge to the same set of chips that are connected by similar interconnect technologies. A good CAF implementation can draw on several compiler-based optimizations that are unavailable to a library-based approach. If the hardware supports specialized interconnect technology, as is the case for several offerings from Cray, then the user can get a greater benefit from those technologies too. If GPUs are also available, and if the vendor's compiler supports OpenACC, the CAF user can get the dual benefits of an optimized messaging along with optimized GPU usage. We see, therefore, that each parallel programming paradigm offers some advantages. A comparison between CAF and modern one-sided communication features of MPI-3 would, therefore, be most useful.

CAF and the newer features in MPI-3 also share a common philosophy – they recognize the value of one-sided messaging. The implementer has, therefore, to rethink the parallelization strategy from the ground-up if s/he is to benefit from these novel programming paradigms. We find that this forces the CAF and MPI-3 codes to have a very similar structure. The codes used in this comparison, therefore, have identical structure except for the messaging. This ensures that a fair comparison has been made. However, we hope that the rest of the paper also demonstrates to the reader that the CAF implementations are cleaner, very compact and very expressive.

Our first application is based on FFT techniques. FFT methods are representative of spectral and pseudo-spectral methods (Canuto [8], Canuto et al., [9]). In such methods, a spectral transform (usually an FFT) is carried out in each direction of a three dimensional problem and the PDE is solved in spectral space. Once the solution is obtained in spectral space, an inverse transform can be used to obtain the solution in real space. The FFTs (or other spectral transforms) work most efficiently if all the data for each one-dimensional FFT is available on a single processor. That is, we need to arrange the data so that it is contiguous in the direction in which the transform is being taken. The transforms have to be taken in all three dimensions, which means that the data have to be rearranged at least twice per timestep. The amount of data communicated can be quite large, and every processor gets data from every other processor. This application, therefore, favors messaging paradigms that optimize bandwidth. Such methods are very desirable for turbulence simulations where the solution is desired on a logically rectangular grid. Modern compact schemes (Lele [13]) also have a similar structure where each direction has to be treated with an implicit solver.

Our second application involves a magnetohydrodynamics (MHD) code (Balsara [1], [2], [3]). The application is prototypical of a large class of higher order Godunov schemes that are very popular for solving computational fluid dynamics (CFD) problems. In this class of applications, the simulated data is present on logically Cartesian patches and the data for each patch are localized on a single processor. The goal of the higher-order Godunov solver is to step this data forward in time using a sequence of timesteps. The update of a zone usually requires a halo of zones around it. A second-order scheme requires a halo of one or two zones, depending on how the time-update is structured. A third-order scheme requires a halo of two or three zones, depending on how the time-update is structured. Thus, one has to make only a few small halo exchanges for a given patch of data. These halo exchanges are easy to buffer. Godunov schemes do several very detailed computations per zone and per timestep. It is not unusual to have several thousands of floating-point operations for the update of a single zone over a single timestep in an MHD calculation. (Or one might need several hundreds of floating-point operations for the update of a single zone over a single timestep in a CFD calculation.) Because of the high on-processor cost of a single timestep, the cost of messaging can be very successfully amortized in applications of this type. As a result, they have excellent scalability. We include them here because we wish to show that both paradigms for parallel programming perform admirably on this application, as expected.

Hyperbolic PDE problems, like the one in the previous paragraph, are seldom solved in isolation. Most physical applications have an elliptic or parabolic PDE solver in addition to the hyperbolic PDE solver. The quintessential elliptic problem is the Poisson problem and the fastest way to solve such a problem in a serial setting is the multigrid method (Brandt [5], [6], Briggs, Henson and McCormick [7], Trottenberg, Oosterlee and Schuller [18]). As a result, the parallelization of this class of application is also interesting, with the result that multigrid methods constitute our third class of application. The methods consist of improving the solution by considering it on a sequence of meshes, each coarser than the next. These meshes are known as levels, so that the problem solution proceeds from finest level to coarsest level and then back to the finest level. On each level, the solution is improved by performing just a few relaxation steps; usually four to eight relaxation steps are used in three dimensions. These relaxation steps are very inexpensive and cost only a few floating-point operations per mesh point (Yavneh [20], [21]). Between each relaxation step, a small number of halo exchanges are needed for each patch of distributed data. The cost of these halo exchanges is, however, very difficult to amortize given the small number of floating-point operations. To transfer the solution from a fine level to a coarser level requires a restriction step. Again, there are very few floating-point operations in a restriction step. Likewise, to transfer the solution from a coarse level to a finer level requires a prolongation step, which again has a very small number of floating-point operations. We see that just like the halo exchanges at a given level, the messaging in the restriction and prolongation steps across levels is very difficult to amortize. The data communicated can be quite few, but many such messages have to be exchanged by each processor. Our multigrid application is the diametric opposite of our FFT application. Our third application, therefore, favors messaging paradigms that minimize latency. The ideal messaging paradigm is one that performs admirably across these competing performance requirements. By including a spectrum of applications in the same paper, we are in a position to ascertain the strengths and weaknesses of CAF and MPI-3.

This paper has three goals. First, we wish to compare the weak scalability of CAF and MPI-3 for all of our target applications on rather large numbers of processors. Second, we wish to compare the newer MPI-3 with the older MPI-2 usage. Third, we wish to catalogue some of the best-usage strategies that we have found for CAF and MPI-3 for our target applications.

For the sake of completeness, it is worth mentioning that an early scalability study of multigrid methods operating under CAF and MPI has been presented in Numrich, Reid and Kim [16]. That study was restricted to multigrid methods on up to 64 processors using a compiler with coarray features as an extension of Fortran 95. By contrast, our study includes several applications that stress parallel programming paradigms in different ways. We also carry out our study on up to several thousands of processors using standard-conforming compilers. Most importantly, Numrich, Reid and Kim [16] only had access to an early version of MPI; whereas here we have an opportunity to compare the one-sided, non-blocking messaging features in CAF and MPI-3 with the older blocking messaging from MPI-2.

It is also very useful to mention other recent innovations in this field. Our work has used traditional multigrid methods where synchronization is done as needed. However, Bethune et al. [4] have studied the value of asynchronous Jacobi iterations where remote halo data may be obtained from any previous iteration. This reduces the need for frequent synchronization, and has potential value for Exascale applications. Messaging transactions for interprocessor communication can also be dynamically aborted, as shown by Gramoli and Harmanci [11]. This again enables another level of performance that might be needed for Exascale computation. Yang et al. [19] and Fanfarillo et al. [10] have also used MPI-3 as a communication substrate for CAF. A GASnet-based communication substrate has also been explored by the same authors. The work of Fanfarillo et al. [10] is especially useful for applications-oriented computational scientists and engineers because they build an MPI-3-based communication substrate directly into the public-domain CAF compiler from GNU. This gives the end-user the ease of expression that CAF provides in conjunction with the portability and generality of MPI-3. Compiler-based optimizations have also been discussed by Fanfarillo et al. [10] and Yang et al. [19]. Fanfarillo et al. [10] also provide an in-depth study of communication bandwidth and latency of CAF versus MPI-3 using a micro-benchmark test suite. Our present study does not use a micro-benchmark test suite; instead, we use complete scientific applications. While micro-benchmark test suites give greater insight to computer scientists, comprehensive studies like this one are, in our humble opinion, more useful and reassuring to computational scientists and engineers.

The plan of the paper is as follows. Section 2 provides a brief overview of CAF messaging. (We do not provide an analogous section for MPI-3 because the MPI Forum [15] has provided a detailed introduction to MPI-3.) Section 3 considers the fundamental communication problems when solving PDEs. Section 4 provides weak scalability studies using CAF and MPI for FFT, MHD and multigrid applications. Section 5 presents discussions and conclusions.

Section snippets

A very brief introduction to coarrays

As with MPI, a coarray Fortran program executes as if it were replicated a fixed number of times. The programmer may retrieve this number at run time through the intrinsic procedure num_images(). Each copy is called an image and executes asynchronously usually, but not necessarily, on a separate physical processor. Each image has its own index, which the programmer may retrieve at run time through the intrinsic procedure this_image(). Consequently, an image can find its image number and store

Halo exchange

Many methods for solving partial differential equations divide the computational domain into patches and hold the representation of each patch on a single processor. This is reasonable for applications that involve halo operations, and our MHD (or any flow solver) is like that. The patches have boundaries with neighboring patches. Active zones that abut these boundaries form a small halo of zones at the boundaries of each patch. Some of the values in this halo of zones have to be obtained from

Weak scalability comparisons between CAF and MPI-3

In this section we present weak scalability studies for CAF and MPI-3. The codes were nearly identical with the only difference that one-sided messaging was used for CAF (see Section 3.1) whereas one-sided MPI_GET routines were used for MPI-3 (see Appendix A for an example). Scalability studies are shown for FFT techniques, Multigrid methods and for an MHD code.

CAF is now available on all Cray machines. We had access to two different types of Cray machines that span two generations of

Conclusions

CAF and MPI-3 represent new paradigms for one-sided messaging that are especially well-adapted to advanced PetaScale and future ExaScale architectures. This style of messaging has the potential of reducing messaging time and enhancing performance on those architectures. CAF is a language-based approach and MPI-3 is a library-based approach; both approaches to parallelism have their unique advantages. CAF has become available via several compiler vendors and the MPI-3 library, with some of the

Acknowledgements

DSB acknowledges support via NSF grants NSF-AST-1009091, NSF-ACI-1307369 and NSF-DMS-1361197. DSB also acknowledges support via NASA grants from the Fermi program as well as NASA-NNX 12A088G. Computer support on NSF's XSEDE and Blue Waters computing resources is also acknowledged. All three authors gratefully acknowledge the cheerful help and insightful advice provided by Bill Long and Pete Mendygral of Cray.

References (21)

There are more references available in the full text version of this article.

Cited by (0)

View full text