1 Problem Statement

Neighborhood collective operations have been introduced to the MPI standard in version 3.0 [5]. Not only could they simplify the code of, for example, multidimensional stencil computations, but also offer a performance benefit over naive handwritten exchange algorithms using MPI_Send and MPI_Recv.

So far no microbenchmarks are available to assess the performance of MPI neighborhood collectives on virtual topologies. Intel MPI Benchmarks 2017Footnote 1, OSU Micro-Benchmarks 5.3.2Footnote 2 and SKaMPI 5.0.4Footnote 3 [6] do not offer such functionality at all. While NBCBench 1.1Footnote 4 [2] can measure LibNBC’s nonblocking neighborhood Alltoall(v) algorithms, it has not been extended and used to measure the corresponding MPI operations. Further, the used neighborhood is built using the deprecated operation MPI_Graph_create. The only parameter available for topology construction is the number of neighbors per process. The structure of the neighborhood can not be varied further.

In [9] a microbenchmark has been used to compare the durations of a new family of sparse collective operations, which work on isomorphic neighborhoods, to those of the corresponding MPI neighborhood collectives. However, while the MPI operations served as a baseline to explicate performance expectations for the new operations, no expectations for the MPI functions have been formulated and assessed there.

In this article, performance expectations for MPI neighborhood collective operations, as well as for the topology creation functions MPI_Cart_create, MPI_Dist_graph_create and MPI_Dist_graph_create_adjacent are motivated and semiformalized using the concept of self-consistent performance guidelines [7]. A microbenchmark based on the one used in [9] is described in detail, which is able to semiautomatically assess these guidelines and generate plots of violations, partly using the concepts presented in [4]. Setup and results of first measurements on two different cluster computers, as well as the assessment of a subset of the presented guidelines are shown to illustrate the methodology and to gain first insights into the performance of current MPI libraries.

Section 2 describes performance guidelines for neighborhood collectives and topology creation functions. In Sect. 3 the benchmark is introduced. Section 4 details the experimental setup of the measurements carried out. The results of the experiments are shown and analyzed in Sect. 5. Section 6 concludes the article.

2 Performance Guidelines for Neighborhood Collectives

Self-consistent performance guidelines are a means to express performance expectations for MPI in a semiformal way by relating the durations of different (combinations of) MPI operations which yield the same effect. Since the MPI standard does not impose any performance requirements, the guidelines are argued for on the basis of self-evident user expectations, which are represented by a set of metarules in [7].

A guideline of the form \(a \preceq b\) means that operation a shall not be slower than operation b, given all common parameters of both operations are equal [7]. Accordingly, \(a \approx b\) means a and b shall perform similar. The relation is required to hold in the average case for many runs, while isolated counterexamples possibly due to lazy initialization or disturbing factors during the measurement are not considered a violation. If \(a \preceq b\) is violated, performance would increase if the user replaced a with b in the violating scenario.

In this section, the following performance guidelines will be motivated:

figure a

GL1 states that if a Cartesian-shaped topology is constructed, the specialized MPI_Cart_create should not be slower than MPI_Dist_graph_create_adjacent, which can construct topologies of arbitrary shape (cf. metarule 3).

A DISTGRAPH topology can be created either by MPI_Dist_graph_create or by MPI_Dist_graph_create_adjacent. While in a call to MPI_Dist_graph_create every process may specify an arbitrary set of edges of the topology graph, MPI_Dist_graph_create_adjacent imposes the precondition that every process passes exactly its incident edges. Because of this additional requirement, MPI_Dist_graph_create_adjacent shall not be slower (GL2, cf. metarule 2).

GL3 asserts that allowing a topology constructor to change the mapping of rank numbers to actual processes by setting the reorder flag to 1 should not speed up the actual creation, since reordering would be beneficial for subsequent communication operations on the topology and disabling it is, from a performance point of view, only reasonable to save extra cost during communicator creation.

MPI_Neighbor_allgather could be mimicked by MPI_Neighbor_alltoall, if the send buffer is copied locally n times, and should therefore not be slower. The same is true for the respective vector variants (GL4). MPI_Neighbor_allgatherv and MPI_Neighbor_alltoallv/-w can mimick their regular counterparts, which should therefore not be slower (GL5, GL6, cf. metarule 3).

The neighborhood collectives can be used to simulate the global collectives MPI_Allgather(v) and MPI_Alltoall(v/w), if a fully connected graph topology is created. While neighborhood collectives support any topology, global collectives always follow a complete graph and because of this specialization should not be slower (GL7, 8, cf. metarule 3).

If a Cartesian-shaped topology is created using one of the distributed graph constructors, neighborhood collectives should not get faster compared to the semantically stricter Cartesian topology created by MPI_Cart_create (GL9, cf. metarule 2). However, their performance for any DISTGRAPH topology should be independent of the constructor, because DISTGRAPH constructors produce semantically equivalent topologies (GL10).

GL11 states that neighborhood collectives should perform similar on isomorphic topologies, independent of the ordering of the list of ranks passed to the topology constructor to define edges. If this was not respected by an MPI library, the user would be tempted to find the “sweet” ordering herself, possibly breaking performance portability between libraries. If the implementations of the neighborhood collectives of a specific library had such sweet orderings, the topology constructor should reorder its input lists accordingly.

Allowing the ranks to be reordered during communicator creation shall not slow down any neighborhood collective since the whole point with reordering is optimizing communication performance (GL12).

3 The Benchmark

The microbenchmark used for the experiments comprises a kernel executing the actual measurements and a framework of scripts responsible for control flow, input generation and output analysis. It makes use of findings from [3, 4, 9].

The main goal of the benchmark is to help identify performance problems of MPI implementations in specific environments. Although some decision metrics are defined to enable automatically finding violations of performance guidelines, the quantification of a violation is of less interest than the fact that a violation has been found. While the question for the severeness of a certain violation might uncover sensational answers, it will not help much solving it, except maybe to set priorities for which one to tackle first. In fact, investigating its cause by, for example, looking into algorithms and parameters of the MPI implementation is the step meant to follow the use of the benchmark.

Fig. 1.
figure 1

Results comparing Alltoall to Neigh_alltoall, campaign full-rand-jupiter, \(35\times 16\) processes, Full neighborhood with RAND ordering and \(\texttt {reorder}=0\), outliers removed.

3.1 Kernel

The kernel implements measurement setups for different MPI operations, following the form of Algorithm 1. All input parameters are read from a CSV input file with each line describing one experiment. Apart from the parameters specific to the topology and explained below, an experiment description includes the communication operation to use, the MPI datatype, the message length, the number of consecutive repetitions \(n_\text {rep}\) and a 64-bit integer w which is decremented down to 0 in a loop written in inline assembler to accurately simulate local computation of specific CPU time when measuring the overlap of nonblocking operations.

The synchronization is implemented as a handwritten dissemination barrier like in [9] to improve comparability between different MPI implementations, which might use different algorithms for their MPI_Barrier. MPI_Wtime is used to retrieve wall clock time with high resolution. The maximum duration of all parallel processes is output as the result time \(\varDelta t[i]\) of each single repetition i.

figure b

Currently, four different types of neighborhoods are supported, three of which can be described using a similar mask of relative coordinates for each process and a mapping of process ranks to a virtual Cartesian grid of variable dimensionality: The Cartesian neighborhood just uses the MPI_Cart_create topology constructor, giving the simplest form of a multidimensional isomorphic neighborhood by only including the two immediate neighbors along each dimension in the Cartesian grid. The Moore neighborhood of radius r, on the other hand, includes all ranks in the grid within a hypercube with edges of length \(2r+1\) around the process. The von Neumann neighborhood of radius r is a subset of the Moore neighborhood, including only those ranks with relative coordinates c with a Manhattan distance \(\le r\) from the process, i.e. \(\sum _{i=1}^{n_\text {dim}}|c_i| \le r\).

The communicators for Moore and von Neumann neighborhoods are constructed using MPI_Dist_graph_create or MPI_Dist_graph_create_adjacent. The relative coordinates are computed and then translated to a list of source- and destination ranks respectively using an intermediate Cartesian communicator and MPI_Cart_rank. The list of source ranks is coordinatewise inverse to the list of destination ranks. MPI_Dist_graph_create_adjacent constructs the topology using both lists. In case MPI_Dist_graph_create is used, each process only passes its destination list, leaving the ordering of any internal processwise source list within the communicator to the MPI implementation.

Input parameters for Cartesian, Moore and von Neumann neighborhoods are the number of dimensions \(n_\text {dim}\), the number of finite dimensions \(n_\text {fin}\), i.e. how many of the dimensions are nontoroidal, and the reorder flag of the MPI topology constructor specifying whether the MPI library is allowed to change the mapping of rank numbers to processes to better fit the actual network topology and accelerate communication. This flag does not affect the intermediate Cartesian communicator used to construct Moore and von Neumann neighborhoods, whose reorder flag is always set to 0.

Further parameters for Moore and von Neumann neighborhoods include the radius r and the ordering of the list of relative coordinates before translation to rank numbers. Possible orderings are first- and last-coordinate-major (FMAJ, LMAJ) and randomized (RAND). FMAJ (LMAJ) means coordinatewise sorted in ascending order with first (last) coordinate changing last. RAND means the list is permuted using a different random seed for each process.

The dimensions of the Cartesian grid in case of Cartesian, Moore and von Neumann neighborhoods are calculated using the TUW_Dims_create function implementing the algorithm described in [8], as the result of MPI_Dims_create has proven to break portability between MPI libraries in the past.

The fourth neighborhood type, the Full neighborhood, connects all processes in a complete graph, i.e. every process is neighbor of all other processes. Like with Moore and von Neumann neighborhoods, it can be selected whether MPI_Dist_graph_create or MPI_Dist_graph_create_adjacent is used as the constructor. Input parameters further include the reordering flag and the ordering of the sources- and destinations list. LINEAR ordering means all rank numbers starting from the process itself are enumerated incrementally (sources) and decrementally (destinations) modulo the total number of processes. RAND ordering means both sources and destinations array are randomized independently and with a different random seed for each process. If MPI_Dist_graph_create is used, only the destinations array is passed.

3.2 Framework Scripts

Control flow of a measurement campaign is programmed in bash scripts, which offer an immediate way to automatize calling programs and manage input- and output files. The work flow to run a measurement campaign is semiautomatized by encapsulating five user-invoked steps: (1) building the kernel, (2) creating input files and job scripts, (3) submitting the job scripts to the scheduling system, (4) archiving the results, (5) analyzing results for performance guideline violations and drawing plots of detected violations.

The files to configure a measurement campaign include a campaign configuration file, in which a list of process deployments is specified, e.g. \([2\times 8,4\times 8]\) for 2 and 4 nodes with 8 processes each. The number of distinct calls to the kernel, \(n_\text {run}\), is set, as well as a maximum run time, after which the scheduler will kill the job. A separate environment configuration is referenced, containing all machine specific settings like paths and the syntax of the mpirun command.

The input to the kernel, i.e. the actual experiments carried out, is generated in step (2) using a Python script, which makes modelling all kinds of relations between different input parameters easy, e.g. “for all \(n_\text {dim}\) create experiments with \(n_\text {fin}\in \{0,1,\dots , n_\text {dim}\}\)”. For each run of the kernel, a separate input file is created with a different random permutation of the same set of experiments to mitigate systematic bias by disturbing factors.

Processing results and assessing the guidelines is done in an R script in step (5). Some assessment configuration needs to be set up by the user: all parameters of the campaign must be subdivided into a guideline parameter, a varied parameter and grouping parameters. The guideline parameter contains the levels to be compared within a guideline – if Neigh_allgather and Neigh_alltoall are to be compared, the guideline parameter would be the measurement setup. A list of guidelines of the form \(a \preceq b\), with ab being levels of the selected guideline parameter, must be provided. The varied parameter will be on the x axis of subsequently generated plots and could, for example, be the message size. All remaining parameters, e.g., neighborhood type, \(n_\text {dim}\), \(n_\text {fin}\), ..., are considered grouping parameters, with every combination of their levels implying a unique group. For each group containing at least one violation, plots will be generated. The script must be rerun for every different guideline parameter.

The script will first calculate the median \(m_l^r:= \text {med}(\text {dropOutliers}(\varDelta t_l^r[0],\dots ,\varDelta t_l^r[n_\text {rep}-1]))\) of the \(n_\text {rep}\) single durations of each run r and each combination of parameter levels l after filtering outliers. This results in \(n_\text {run}\) medians \(m_l^0,\dots ,m_l^{n_\text {run}-1}\) for each combination of parameter levels l. Outliers are values outside of \([q_1 - 1.5(q_3-q_1), q_3 + 1.5(q_3-q_1)]\), with quartiles \(q_1,q_3\), like suggested by Tukey [1, Subsect. 3.2.4].

For each guideline \(a \preceq b\) and each unique combination of the grouping parameters and the varied parameter, the \(n_\text {run}\) medians for a and b are selected. The Wilcoxon rank sum test is then carried out to test whether the medians of a are shifted to the right of b [1, 4, Subsect. 7.4.6]. Further, the violation ratio \(v:=\frac{\text {med}(m_a^0,\dots ,m_a^{n_\text {run}-1})}{\text {med}(m_b^0,\dots ,m_b^{n_\text {run}-1})}\) is computed to quantify the difference between a and b. \(a \preceq b\) is considered violated for the selected parameter levels, if \(v \ge v_\text {thres}\) and the test returns a p-value \(\le p_\text {thres}\). \(p_\text {thres}, v_\text {thres}\) are set by the user. The threshold for v filters very small violations considered significant by the statistical test.

For each violation, an overview plot of the affected group is created, which shows the medians of the medians of the result times for both parameter levels ab in absolute numbers on a log scale, with the varied parameter, e.g. the message size, on the x axis. Further, for each violation, a focus plot is generated, which shows the distributions of the raw results within the individual runs as box plots, normalized to the median of the medians of the durations of a. Figures 1a and 2a give examples for overview plots, Figs. 1b and 2b for focus plots.

4 Experimental Setup

Five measurement campaigns on two different cluster computers have been carried out to assess a subset of the formulated guidelines (see Table 2). In the nbhcoll campaigns, neighborhood collective operations have been executed on Cartesian topologies, as well as on von Neumann and Moore neighborhoods of radius 1, which could all be used in real-world applications performing stencil computations [9]. In the full campaigns, a complete graph is used as topology, making the neighborhood collectives behave like their global collective counterparts, which have been measured here as well. Campaigns full-rand-jupiter and full-tuned-jupiter have been set up and executed because of findings from full-jupiter; see Sect. 5 for details.

The von Neumann neighborhood of radius 1 exactly resembles a Cartesian topology; the subsequently used notation \(\texttt {Cart} \preceq \texttt {Vneum}\) refers to GL9. GL11 is tested by comparing neighbor list orderings FMAJ and LMAJ (nbhcoll) or LINEAR (full) to RAND. The term \(\texttt {reorder=1} \preceq \texttt {reorder=0}\) refers to GL12.

Table 2 lists all parameters of the executed experiments together with the parameter levels used in the respective campaigns. For example, in campaign nbhcoll-jupiter, the two operations Neigh_allgather and Neigh_alltoall have each been measured with 15 different message sizes, on three different topologies, with four different numbers of dimensions, two different values for the number of finite dimensions, three different orderings of the list of neighbors in case of von Neumann and Moore neighborhoods (Cartesian topologies do not have an ordering), and both possible values for the reorder flag during communicator creation. This makes a total of 3360 unique combinations of parameter levels, which are experimentally measured \(n_\text {rep}=50\) times in each of \(n_\text {run}=30\) runs. If, for example, the guideline \(\texttt {Neigh\_allgather} \preceq \texttt {Neigh\_alltoall}\) is evaluated, i.e. the measurement setup is chosen as the guideline parameter, the statistical test is executed for the \(\frac{3360}{2} = 1680\) unique combinations of the remaining parameter levels. Since the varied parameter is the message size, results are presented in \(\frac{1680}{15}=112\) groups.

The first system, Jupiter, has 36 nodes with two AMD Opteron 6134 8-core processors at 2.3 GHz and 32 GB memory each, connected via a Mellanox MT4036 InfiniBand QDR crossbar switch. The second system is VSC3 at the Vienna Scientific Cluster, consisting of 2020 nodes with two Intel Xeon E5-2650v2 8-core processors at 2.6 GHz and 64 GB memory each. The nodes are connected by an InfiniBand QDR-80 fat tree architecture. On Jupiter, both nodes and network links involved in the measurements were dedicated to the benchmark. On VSC3, only the nodes were dedicated, while network switches were possibly shared with other jobs. The benchmark has been compiled and run using gcc 4.4.7 and Open MPI 2.0.1 on Jupiter and gcc 5.3.0 and Intel MPI 2017.1 on VSC3.

The dimensions of the virtual Cartesian grid of processes for the different process deployments and number of dimensions in the nbhcoll campaigns are listed in Table 1.

Table 1. Dimensions array returned by TUW_Dims_create for different \(n_\text {dim}\) and \(n_\text {procs}\).
Table 2. Measurement campaigns referenced in this article.

5 Results

Table 3 lists the numbers of violations of different guidelines for the nbhcoll campaigns on Jupiter and VSC3. Each cell contains two rows: first, the total numbers of violations and tests, second the numbers of groups containing at least one violation as well as the total number of groups. In a group, all parameters are similar except the message length (varied parameter) and the respective guideline parameter. The threshold values for the assessment are set to \(p_\text {thres}=0.001\) and \(v_\text {thres}=1.03\). Different thresholds have been tried, but for higher p-values and lower violation ratios, violations were often not clearly visible in the plots.

In the nbhcoll campaigns, the guideline \(\texttt {Neigh\_allgather} \preceq \texttt {Neigh\_alltoall}\) was only violated for the smaller numbers of processes. On Jupiter, violations occurred for \(n_\text {dim}\in \{2,4\}\) and \(n_\text {fin}=0\), on all three neighborhoods, for all orderings of the neighborhood coordinates, with ratios up to 1.049. The two violations on VSC3 occurred with a fourdimensional Moore neighborhood, \(n_\text {fin}=4\), LMAJ ordering, \(\texttt {reorder}\in \{0,1\}\) and a message size of 4 KB. Their exceptionally high ratio of about 13.8 each stems from a peculiar effect observed on VSC3 for different measurements: the relative dispersion of many runs is in the same order of magnitude like the violation ratio, with the quartiles of many runs spanning from the median of medians of the Neigh_allgather times to the median of medians of the Neigh_alltoall times. Usually, dispersion was much lower, like in the figures from Jupiter in this article. Unfortunately, due to time restrictions, this effect could not be investigated further for this article. The measurements should be rerun with a different node allocation to eliminate a possible interdependency between node allocation, virtual topology and communication algorithm. Note that temporary network effects can already be excluded as a cause due to the randomization of experiments.

On Jupiter, the guideline \(\texttt {Cart} \preceq \texttt {Vneum}\) has been violated only with fourdimensional neighborhoods, while the violation ratio did not exceed 1.045. On VSC3, most violations happened for \(n_\text {dim}=4\) as well, including the most severe ones with ratios up to 1.222. For \(35\times 16\), violations occurred only with \(n_\text {dim}=4\), for \(20\times 16\) with \(n_\text {dim}\in \{3,4\}\), and for \(10\times 16\) even with \(n_\text {dim}\in \{2,3,4\}\).

The guideline \(\texttt {FMAJ} \preceq \texttt {RAND}\) was violated only by Moore neighborhoods with \(n_\text {dim}\in \{2,3,4\}\) on Jupiter, with a ratio of up to 1.147. \(\texttt {LMAJ} \preceq \texttt {RAND}\) was violated by both Moore and von Neumann neighborhoods, but only for \(n_\text {dim}=4,n_\text {fin}=0\) and \(10\times 16\) processes. Moore neighborhoods yielded a ratio of up to 1.206. On VSC3, violations of both guidelines occurred for all values of the grouping parameters. The biggest ratio observed in all experiments, 141.9, occurred for \(\texttt {FMAJ} \preceq \texttt {RAND}\), Neigh_alltoall, a Moore neighborhood with \(n_\text {dim}=2,n_\text {fin}=2\), independent of reordering and for a message size of 11585 B. This enormous ratio was due to the same effect on VSC3 mentioned above. However, most of the other reported violations did not suffer from this effect.

The only guideline violated in campaign full-jupiter was \(\texttt {LINEAR} \preceq \texttt {RAND}\) and most violations were quite clear. Top ratios increased with number of processes from 1.083 (\(10\times 16\)) to 1.828 (\(35\times 16\)). While for \(10\times 16\) processes, only the smaller message sizes up to 1448 B were affected, for \(30\times 16\) processes violations occurred in the whole spectrum of the message sizes used.

In the full-jupiter campaign, to save time, the global collectives were only executed on topologies with LINEAR ordering because ordering was assumed to make no difference for them. Since the architecture of the benchmark only allows for the levels of one parameter being compared to each other, with all other parameters being the same, comparing neighborhood collectives with RAND orderings to global collectives was not possible in this campaign. However, the fact that the guideline \(\texttt {LINEAR} \preceq \texttt {RAND}\) was violated so often together with the observation of Alltoall and Neigh_alltoall performing similar with LINEAR ordering for small message sizes lead to the assumption that \(\texttt {Alltoall} \preceq \texttt {Neigh\_alltoall}\) could be violated on full RAND topologies. Therefore, campaign full-rand-jupiter was set up and executed, and indeed showed the expected violations for small message sizes (cf. Fig. 1).

Fig. 2.
figure 2

Results comparing Alltoall to Neigh_alltoall, campaign full-tuned-jupiter, \(35\times 16\) processes, Full neighborhood with RAND ordering and \(\texttt {reorder}=0\), outliers removed.

A closer look into Open MPI revealed that the algorithm for Alltoall is changed by default at a message size of 3000 B. The so called MCA parameters allow to change such thresholds at runtime. In the campaign full-tuned-jupiter, the violations could be healed by setting the threshold message size to 256 B (cf. Fig. 2). In the case of LINEAR ordering, Alltoall now was considerably faster than Neigh_alltoall as well.

In the three campaigns nbhcoll-jupiter, nbhcoll-vsc3 and full-jupiter, the \(\texttt {reorder=1} \preceq \texttt {reorder=0}\) guideline was never violated and the reordering flag did not seem to have an effect on violations of the other guidelines (cf. Tables 3 and 4). Subsequent experiments just creating the topologies used in the campaigns with reordering enabled and checking for a change in process-to-rank mapping in the new communicator confirmed this conjecture. Campaigns full-rand-jupiter and full-tuned-jupiter have therefore been set up with reordering disabled in general.

Table 3. Number of guideline violations in experiments with different neighborhoods of radius 1 on Jupiter and VSC3. Format: \(n_\text {vioTests}/n_\text {tests}\) \((n_\text {vioGroups}/n_\text {groups})\).
Table 4. Number of guideline violations in campaign full-jupiter. Format: \(n_\text {vioTests}/n_\text {tests}\) \((n_\text {vioGroups}/n_\text {groups})\).

6 Conclusion and Outlook

Performance guidelines help to express expectations for neighborhood collectives in a formal way, to enable computers to automatically check them on a large number of measurements. Results show that current MPI implementations probably have room for improvement of their performance, although admittedly, not every violation can easily be attributed to the MPI implementation in the complex environment of a cluster computer without further investigation. Still, especially the cases where simulating a Cartesian with a similar DISTGRAPH topology increases performance are surprising, since algorithms could benefit from the fixed structure of Cartesian topologies. This, together with the violations of GL11, suggests that the examined MPI implementations are sensitive to the ordering of neighbors.

In the future it would be interesting to execute similar campaigns on further cluster computers, especially such with a network topology resembling a Cartesian grid. Measuring with bigger neighborhoods could be interesting as well, although the question arises whether there are problems from the real world which would be affected by the results. Of course, the remaining guidelines formulated, but not evaluated in this article should be tested – especially those dealing with different methods of communicator creation. Since some MPI implementations nowadays promise true asynchronous progress, guidelines for nonblocking neighborhood collectives should be formulated and assessed as well.