1 Introduction

A parallel reduction can be implemented on SMP computers [1] by using OpenMP [4, 5] in many ways. In addition to the built-in support for the reduction operations a programmer can implement it himself by using standard work-sharing constructs provided by the OpenMP or by using “manual” distribution of the work among the available CPUs.

The article shows that the built-in specialized reduction clause [5] is easy to use but it is neither time- nor cost-optimal, especially for small sets of operands reduced by an operator with high time complexity.Footnote 1 It also shows how to implement time- and cost-optimal parallel reduction algorithm suitable for any problem size.

It is supposed the reader has knowledge of ANSI C/C++ programming language and at least basic notion of OpenMP library.

2 Parallel Reduction on PRAM

Consider EREW PRAM [6, 7] parallel reduction algorithm with \( p \) processors \( P_{0} ,P_{1} , \ldots ,P_{n - 1} \) and with \( n \) shared-memory cells \( M[0],M[1], \ldots ,M[n - 1] \) described in Algorithm 2. It is obvious that \( p = n \). In contrast to the sequential version of the reduction algorithm with time complexity \( \varTheta (n) \), the optimal parallelized algorithm can compute the result with time complexity

$$ T_{(n,p)} = O({ \log }_{2} (p)) $$
(1)

As can be seen from the Algorithm 1 the computation consist of \( { \log }_{2} n \) sequential phases \( j = 1, \ldots ,\left\lceil {{ \log }_{2} n} \right\rceil \) where just \( \frac{n}{{2^{j} }} \) processors do useful work in each phase as shown in Fig. 1.

Fig. 1
figure 1

EREW PRAM parallel reduction

Now let us evaluate the cost [7] of the parallel reduction algorithm. Generally, the cost of a parallel algorithm solving a problem of size \( n \) on \( p \) processors denoted by \( C_{(n,p)} \) is defined as

$$ C_{(n,p)} = p \times T_{(n,p)} $$
(2)

Assume that a sequential upper bound [7] denoted by \( SU(n) \) on number of sequential steps to solve a problem is known. A parallel algorithm solving the same problem is said to be cost-optimal if

$$ C_{(n,p)} = O(SU(n)) $$
(3)

The upper bound of the sequential reduction algorithm is \( SU(n) = \varTheta (n) = O(n) \) so the cost of the parallel reduction algorithm defined in Algorithm 2 is \( C_{(n,p)} = p \times O({ \log }_{2} p) \). For non-scaled parallel reduction algorithm \( p = n \) which implies \( C_{(n,p)} = p \times O({ \log }_{2} p) = \varOmega (n) \), hence, the algorithm is not cost-optimal since the condition defined in (3) is not met. The problem of the implementation is a lack of useful work distributed across the parallel system.

One of the possible ways how to improve the cost of the parallel algorithm is to set better granularity, i.e. to change a ratio between size of solved problem \( n \) and the number of processors \( p \) used for the calculation by using so called scaling of the algorithm [7].

Typically, decreasing number of used processors leads to better cost and efficiency of a parallel algorithm as stated in [1]. Therefore, we should modify the Algorithm 1 so the amount of the work for each processor increases. The possible modification is shown in Algorithm 2.

Assume \( p^{\prime} < p \) processors and the size of a problem \( n > p^{\prime} \).

Parallel reduction described in Algorithm 2 and illustrated in Fig. 2 is calculated as follows: Input sequence of reduced items is split into \( p^{\prime} \) sets of size \( \frac{n}{{p^{\prime}}} \) where each processor calculates the partial reduction sequentially. It is obvious that this stage produces \( p^{\prime} \) partial results which are then combined into final result by using Algorithm 1. Time complexity of Algorithm 2 is

$$ T_{(n,p)} = O(\frac{n}{{p^{\prime}}} + { \log }_{2} (p^{\prime})) $$
(4)
Fig. 2
figure 2

Efficient scaled parallel reduction

Cost of the scaled parallel reduction algorithm is then

$$ C_{(n,p)} = p^{\prime}(\frac{n}{{p^{\prime}}} + { \log }_{2} (p^{\prime})) = n + p^{\prime}{ \log }_{2} (p^{\prime}) $$
(5)

In case the \( n \gg p^{\prime} \) the Eq. (5) can be rewritten into

$$ C_{(n,p)} = O(n) = O(SU(n)) $$
(6)

so the algorithm can be regarded as both time- and cost-optimal.

Scaled parallel reduction of 32 items by using 4 processors is illustrated in Fig. 2.

3 Built-in Parallel Reduction Support in OpenMP

The OpenMP Application Program Interface (API) offers cross-platform shared-memory parallel programming framework for C/C++ and Fortran on all architectures, including Unix platforms and Windows NT platforms. Jointly defined by a group of major computer hardware and software vendors, OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer [4].

The OpenMP defines set of compiler’s preprocessor directives for parallel region definition, work-sharing constructs and per-thread synchronization. Behavior of the directives can be further tunned by so called clauses. In addition to the generic directives and their clauses the OpenMP contains also built-in support for parallel reduction provided by reduction clause of for directive. Let us to examine how the parallel reduction can be implemented by using these OpenMP constructs.

Listing 1 shows native OpenMP parallel reduction implementation which uses well-known reduction clause to calculate parallelized summation on shared variable without need of explicit inter-thread synchronization. In addition, this clause also avoids possible data race on the shared variable.

Listing 1: Built-in OpenMP parallel reduction

The most of the OpenMP programmers will probably use this approach but is it really the best possible solution? As stated in GNU OpenMP Implementation notes [3], the reduction clause is implemented so it uses an array of the type of the variable, indexed by the thread’s team id. The thread stores its final value into the array, and after the barrier, the master thread iterates over the array to collect the values. For better understanding what happens inside native OpenMP implementation the code presented in Listing ?? can be rewritten into the following implementation by using atomic operations and private variables as shown in Listing 2.

Listing 2: Built-in OpenMP parallel reduction deconvolution

Now it is obvious, that the native implementation calculated in \( T(n,p) = O(\frac{n}{p} + p) \) has got higher time complexity in contrast to the efficient implementation described in Algorithm 2 with time complexity defined in (4). The Fig. 3 illustrates how reduced operands are handled in this implementation.

Fig. 3
figure 3

Inefficient scaled parallel reduction

Moreover, the reduction clause allows users to use just limited set of available reduction operators like +, −, *, / and \( \% \) so if some custom reduction operator is needed then the user must implement the algorithm himself.

The following chapters focus on various possible implementations of optimal parallel reduction algorithm in OpenMP without usage of the reduction clause.

4 Custom Implementations

Various custom implementations of two main types of reduction algorithms are discussed in the following chapters: trivial and scaled parallel reduction algorithms. For simplicity, let us assume the number of reduced operands n is always factor of two.

4.1 Trivial Parallel Reduction Algorithm

Let us denote n to be the number of reduced operands and p to be the number of available CPUs performing the calculation where \( n = p \). Also assume \( M[] \) to be an array of size n containing all reduced operands. Then the trivial parallel reduction Algorithm 1 can be implemented as shown in Listing 3.

Listing 3: Custom trivial parallel reduction

The algorithm consists of \( { \log }_{2} (n) \) sequential phases where each phase calculates summation of adjacent operands/partial sums as described in Fig. 1. Notice that OpenMP parallel loop directive is used for distribution of the work among the available CPUs there. The work-sharing is performed in each sequential step which leads to significant parallel overhead in this implementation that cannot be neglect. Thanks to this hidden time constant the overall performance of this algorithm is comparable to the one which uses built-in reduction clause even when the theoretical time complexity is better. The comparison can be seen from benchmark results discussed in chapter "An Articial Bee Colony Algorithm for the Set Covering Problem" of this document.

The parallel overhead revealed in previous paragraph can be reduced by omitting the OpenMP parallel loop directive in favour of the work-sharing implemented by using standard C/C++ constructs. Let us discuss modifications of the previous algorithm shown in Listing 4.

Listing 4: Custom optimized trivial parallel reduction

In this implementation, all available CPUs remain active in all sequential steps but only subset of them do useful work. Indexes of enabled CPUsFootnote 2 are calculated in real-time by using modulo operator in contrast to the static work-sharing used in the Listing 3. Notice that each thread in the parallel team evaluates the condition and performs the summation independently to the other active threads so the global barrier placed below the conditional statement is needed for synchronization of subsequent parallel phases.

Benefits of this modification are clearly noticeable from benchmarks shown in chapter "An Articial Bee Colony Algorithm for the Set Covering Problem".

4.2 Scaled Parallel Reduction Algorithm

Both algorithms listed in chapter "A New Approach to Solve the Software Project Scheduling Problem Based on Max-Min Ant System" are implementations of inefficient trivial parallel reduction discussed in chapter "PPSA: A Tool for Suboptimal Control of Time Delay Systems—Revision and Open Tasks" assuming that the size of solved problem is equal to the number of assigned CPUs. Now, let us focus to time- and cost-efficient scaled parallel reduction implementation which allow users to calculate summation of \( n{ \gg }p \) operands with nearly linear speed-up.

Trivial scaled parallel reduction discussed in this chapters enhances the algorithm listed in chapter "A New Approach to Solve the Software Project Scheduling Problem Based on Max-Min Ant System" so it can be used for summation of huge number of operands with limited resources, i.e. available CPUs.

The algorithm consists of two main phases:

The first phase divides set of reduced operands into \( p \) subsets consisting of \( \frac{n}{p} \) items. Each subset is reduced on single CPU in \( T(n,1) = O(\frac{n}{p}) \) and all subsets are processed on \( p \) CPUs in parallel so overall time complexity of this phase is \( T(n,p) = O(\frac{n}{p}) \).

The second phase calculates final result by using optimized parallel algorithm discussed in chapter "A New Approach to Solve the Software Project Scheduling Problem Based on Max-Min Ant System" in \( T(n,p) = O({ \log }(p)) \) from partial results provided by the first phase. Notice that private variables subsum are used by all threads in the parallel team for calculation of partial results instead of direct access to shared array \( S[] \) used in the second phase. This modification avoids parallel overhead needed for inter-thread synchronization during concurrent write access to the shared resources. Moreover, nowait clause of parallel for directive omits unnecessary implicit barrier at the end of the parallel loop.

Global explicit barrier placed between the phases ensures that content of shared array \( S[] \) used as an input in the second phase will be up-to-date after the finish of the first phase.

Overall time complexity of this implementation shown in Listing 5 is \( T(n,p) = O(\frac{n}{p} + { \log }(p)) \). As can be seen from the results published in chapter "An Articial Bee Colony Algorithm for the Set Covering Problem", this implementation guarantee best possible performance of all presented algorithms.

Listing 5: Custom scaled and optimized trivial parallel reduction

5 Performance Benchmarks

All benchmarks published in this chapter were measured on SuperMicro SMP server A+ 1042G-TF with following hardware configuration (Table 1):

Table 1 Hardware configuration for benchmarking

Testing machine was using Ubuntu Linux 12.04 LTS installed inside a virtual machine managed by KVM hypervisor with 16 physical cores assigned and 16 GB RAM allocated.

5.1 Parallel Reductions Benchmark

Set of performance benchmarks was performed to evaluate a qualitative metrics of discussed scaled parallel reduction algorithms and their implementations. The testing application ran on virtual machine specified in the chapter "An Articial Bee Colony Algorithm for the Set Covering Problem". All available physical cores were assigned to OpenMP parallel team. Note that all standard compiler optimizations were disabled by using −O0 GCC compiler switch set during the building to omit any unwanted underlying source code optimizations influencing the measurement.

Results obtained from the tests are shown in Table 2. Parallel (calculation) time as well as the parallel speed-up and efficiency [1] of discusses parallel algorithms can be found there. The column named “Built-in red.” shows performance of scaled parallel reduction implemented by using built-in reduction clause described in Listing 1. The column named “Custom red.” shows performance of scaled algorithm described in Listing 3 while the last column named “Optimized custom red.” shows performance of algorithm described in Listing 5.

Table 2 Performance metrics for discussed reduction algorithms

As can be seen from the table the best results in meaning of the highest speed-up and efficiency and the lowest parallel time provides optimized custom algorithm listed in Listing 5. The speed-up and efficiency degradation visible from the results implies from non-negligible OpenMP implementation overhead. For more clearance the parallel times shown in the table are compared in Fig. 4.

Fig. 4
figure 4

Parallel times of various reduction algorithms

6 Conclusion

The paper shows that the built-in OpenMP support for parallel reduction provided by reduction clause is not neither theoretically- nor practically-optimal at least for some sorts of applications and calculations. It shows that the performance of parallel reduction algorithms can be improved significantly by using custom optimized implementations. However, it is important to note that the measured and discussed results can differ from other specific implementations and usage scenarios, mainly if other compiler-level code optimizations and reduction operators with different time complexity were used.

The paper also shows that the scalability of even trivial OpenMP algorithms is sufficient for various parallel reduction implementations.