Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction and Background

Modern processor architectures are being designed with increasing amount of on-chip parallelism available. These processors are designed to deliver higher computing power by exploiting multiple levels of parallelism. In high-performance scientific computing, these architectures play a central role in delivering the much needed compute and memory resources. In this paper, we consider performance analysis and optimization of an application code developed recently, HipGISAXS [2, 3]. This is a high-performance code for X-ray scattering data analysis [4, 5]. Such data analysis is useful to scientists for the characterization of macromolecules and nano-particle systems based on their structural properties, such as their shape and size, at the micro/nano-scales. Some of the major applications of these include the characterization of materials for the design and fabrication of energy-relevant nano-devices, such as photovoltaic cells, and development of high-density storage media.

In this paper, we consider some of the most compute-intensive kernels of the HipGISAXS code to demonstrate and analyze high-performance through the exploitation of simultaneous multi-threading (SMT) and single-instruction multiple-data (SIMD) parallelism on modern multi-core processors. We discuss various methods to effectively exploit the available on-node parallelism to increase parallel efficiency and provide detailed performance analysis on leading Cray supercomputers. We further provide performance analysis on Intel’s Knights Landing manycore processor, which has even larger number of compute cores with 512-bit wide vector units.

2 Computational Kernels

In HipGISAXS simulation code, the kernels compute on a two-dimensional grid, with non-uniform positioning of the grid points. The simulated scattering light intensities are computed at each grid point. The computations at each grid point are independent of each other, which is amenable to efficient exploitation of parallelism. In each simulation, two types of routines are executed: Form Factor (FF) computations and Structure Factor (SF) computations. The former represent the scattering phenomenon due to individual nanoparticles in the sample model, while the latter represent the scattering due to all particles forming a structure (such as their three-dimensional arrangements) as a whole. Both types contain a number of kernels, each depending on the type of computations performed. The FF kernels could be either analytic, where computations are derived analytically for simple nanoparticle shapes, such as spheres, cylinders, cubes, etc., or numerical, which are generic and used for complex structures when they cannot be efficiently defined through simple shapes. In the following we focus on one of the analytical kernels and one of the numerical kernels for the purpose of demonstrating the exploitation of thread and vector parallelism. The form factor computation at a single q-point involves accessing data and performing independent calculations for each of the input shape triangles, followed by a reduction over all the triangles. This is done for all the q-points in the problem under consideration. The output matrix \(\mathcal {F}\) of the same size as the Q-grid \(\mathcal {Q}\), is constructed with the results of these computations. This is summarized in the following equation:

$$\begin{aligned} \mathcal {F}: f(\varvec{q}) = -\frac{i}{|\varvec{q}|^2}\sum _{t=1}^{n_t}e^{i\varvec{q}\cdot \varvec{r_t}}q_{n,t}s_t, \quad \forall \varvec{q}\in \mathcal {Q}. \end{aligned}$$
(1)

2.1 An Analytical Form Factor Kernel

For certain simple nanoparticle shapes, such as spherical, cylindrical, cuboidal, etc., analytical formulae can be derived to calculate the form factor. In most cases, such analytical computations are significantly cheaper than performing the general-case computations through volume integrals. Hence, if a sample model can be described in terms of simple shapes, it is preferable to use the analytical method. The application under consideration includes a number of such analytical form factor computational kernels. We select one of these analytical kernels, the form factor for a cylindrical shape, for our study in the following sections.

The basic computational loop structure of the cylinder form factor kernel is shown in Algorithm 1. The outermost loop is over all the q-points (\(n_{qz}\)). The number of q-points in a typical simulation is in over 1 million. All the kernels in this application make a heavy use of various transcendental functions, such as sine, cosine, exponential, Bessel, etc. Additionally, since the q-points are in the complex space, the computations are performed on complex numbers. The inner loop-nest in Algorithm 1 are over the two parameters defining the cylinderical shape, radius (\(\mathtt {rsize}\)) and height (\(\mathtt {hsize}\)). Total number of iterations for this loop-nest is typically small, on the order of tens, and in many cases, just one.

figure a

2.2 A Numerical Form Factor Kernel

For sample models where the nanoparticle shapes cannot be defined in terms of simple shapes, a full numerical integration over the particles’ structure needs to be performed. This represents the general case, which can handle any kind of complex shapes. The shapes are defined through discretization of its surface, such as by triangulation. The higher the discretization resolution, the better the computed approximation, but with higher computational cost. The shape surface integration adds another dimension to the computational problem. A typical structure of a numerical form factor kernel is shown in Algorithm 2. As with the analytical kernels described previously, the outermost loop is over the q-points. The primary internal loop is over the set of triangles describing the corresponding structure in the sample. The number of triangles can vary from several hundreds to several millions depending on the structure complexity and discretization resolution.

figure b

3 Computational Platforms and Performance Modeling

In the work discussed in this paper, we perform experiments and performance analysis on the current Cray supercomputers installed at NERSC at Berkeley Lab:

  1. 1.

    Cray XC30 (Edison): Consists of dual-socket compute nodes with 12-core Intel Ivybridge processors, total 24 cores and 64 GB memory per node.

  2. 2.

    Cray XC40 (Cori Phase 1): Consists of dual-socket compute nodes with 16-core Intel Haswell processors, total 32 cores and 128 GB memory per node.

Additionally, we also analyze the performance on Intel’s new Knights Landing (KNL) processor:

  1. 3.

    A single-socket compute node with 1.3 GHz 64-core Intel Knights Landing processor, 96 GB DRAM as well as 16 GB of MCDRAM (high-bandwidth memory).

In order to understand the performance of a code and identify bottlenecks before any optimization is performed, it is helpful to establish a performance model on which the code performance can be measured. In this paper, we present the analysis using the Roofline performance model [6]. The Roofline toolkit was used to obtain the empirical performance bounds for computations (GFLOPs/s) and memory bandwidths (GB/s) with respect to the first-level cache and DRAM. These are discussed in Sect. 6.

4 Threading

An effective way to exploit multiple compute cores available on a processor is through shared-memory threading. In our case, we utilize OpenMP to implement threading in each of the kernels. Since scattering light intensity computations at each q-point are independent of each other, a straight-forward threading of the loop over the q-points is the most efficient approach. The OpenMP threads do not require any significant synchronizations, making the kernel able to make full use of the available cores asynchronously. To achieve a higher threading performance, care should be taken while constructing data structures and buffers so as to minimize data movement overheads, cache line invalidation overhead and page thrashing. A node on typical supercomputers may have multiple NUMA regions. It is programmer’s responsibility to minimize data traffic across these regions. The platforms we use to perform the experiments in this paper are dual-socket nodes (see Sect. 3), giving effectively two primary NUMA regions. Furthermore, to best affinities, we use the \(\mathtt {KMP\_AFFINITY}\) for setting the thread affinities to core-level and compact bindings with permute value of 1 (\(\mathtt {KMP\_AFFINITY=granularity=core,compact,1}\)). Thread strong scaling performance are shown in Fig. 1 for the two systems under consideration.

Fig. 1.
figure 1

Speedups for the two form factor kernels under consideration with increasing number of threads are shown on the Edison (left) and Cori-1 (right) systems, with Q-grid of size 125,000 q-points.

5 Vectorization

Data parallelism in modern processors is provided through wide vector units, which can have various widths such as 128, 256 or 512 bits. Typical multi-core processors, such as the Intel Ivybridge and Intel Haswell, provide 256-bit wide registers and support AVX2 vector instructions. Our application uses double precision computations, enabling the possibility of utilizing these vector units with a 4-way SIMD parallelism.

5.1 Compiler-Based Auto-Vectorization

Modern compilers are quite sophisticated and are able to automatically vectorize many codes. This alleviates the need for implementations that use vector datatypes directly in order to take advantage of the vector units. Typically compilers are able to auto-vectorize inner-most loops if they satisfy certain conditions such as, (1) the loop counts are known before its execution, (2) the loops are single block and do not have branching, particularly those which may break the loop short, (3) there are no backward loop dependencies, (4) computations involve simple operations which either have vector instructions or a vectorized libraries available, and the such. We use Intel compilers version 16 to compile our code for all the experiments presented in this paper. We used various compiler-directed pragmas, including Intel’s \(\mathtt {ivdep}\) and OpenMP \(\mathtt {simd}\). Unfortunately in our case, attempts to auto-vectorize the kernel codes failed. As we described in Sect. 2, the loop structures of the kernels in our application violate most of the requirements for auto-vectorization, since they: (1) make heavy use of transcendental functions, and subroutines with significant amounts of branching, (2) have inner-most loops which have either small iteration counts, such as in the analytical kernel presented previously, or complex structures, like those in the numerical kernels, (3) perform computations on complex number datatypes, which although can be generally auto-vectorized, may be inefficient with respect to non-basic operations.

5.2 Intel Math Kernel Library Vector Functions

Since the compilers are unable to auto-vectorize our codes, we utilized existing vectorized libraries such as the Intel Math Kernel Library’s (MKL) Vector Math Library (VML) and CBLAS (level 1) vector functions to implement the kernels. CBLAS is the C-style interface to the BLAS library. Although use of these libraries enabled vectorization of a large fraction of the kernels, it presented a tradeoff between floating-point accuracy and performance. The default mode in VML is High-Accuracy (HA) Footnote 1. Use of this mode resulted in a performance slowdown of our kernel codes on the Intel Ivybridge processors. The two other accuracy modes available in VML are Low-Accuracy (LA) Footnote 2 and Enhanced-Performance (EP) Footnote 3. Using LA with our kernel code, we observed a slowdown as well. On the other hand, with the EP mode the code showed a performance speedup of up to 1.3\(\times \). Unfortunately, the precision provided by EP mode is too low to be used by our application to perform any useful simulations. These results on the use of Intel MKL are summarized in Table 1.

Table 1. Performance comparisons on Cori-1 between different accuracy models of Intel MKL VML applied to the analytical cylinder form factor kernel.
Fig. 2.
figure 2

(Left) AVX vectorized representation of a complex number using two double precision AVX vector datatype \(\mathtt {\_\_m256d}\). (Right) An example of a complex number multiplication using the available AVX instrinsics for real numbers.

5.3 Hand-Vectorization

Highest performance efficiency can be achieved if the implementation can be done efficiently using vector instructions. Intel provides a set of vector instrinsics, which are low-level C functions, acting as a wrapper around the assembly instructions, and allow a programmer to be more effective in the implementation by removing the need to directly implement in assembly language. Given the low performance boost achieved by using the Intel MKL VML as presented in the previous section, hand-vectorization is an obvious next step to enable efficient vectorization. Additionally, since our kernels perform computations on complex numbers, hand-vectorization gives more flexibility on data encoding and storage. In our implementation we follow the AoSoA (Array of Structures of Arrays) model. The real and imaginary components of a complex number are encoded with vector datatypes into a structure. Arrays of this ‘complex vector’ structure are used in implementing the kernels. This structure can be written for AVX as shown in Fig. 2, where \(\mathtt {\_\_m256d}\) is defined as an AVX datatype of packed four-double-precision length (256-bits). Hence, one complex AVX vector holds four double precision complex numbers. This allows full use of the available SIMD capabilities as opposed to the alternate approach where the full complex numbers are encoded within the same vector with alternate real and imaginary components. This alternative can hold two double precision complex numbers in a single AVX vector, but would involve some redundancy to perform operations on them. We implement all compute operations on vector complex numbers using the first approach and real number vector intrinsics. One of the simplest operations, a complex-complex multiplication may be written using the real multiply and add/subtract intrinsics, also shown in Fig. 2, where \(\mathtt {\_\_mm256\_*\_pd()}\) intrinsics represent the AVX vector operations on the packed double precision AVX datatype.

5.4 Vectorizing Analytical Cylinder Form Factor Kernel

We implement the entire analytical cylinder form factor kernel using the AVX vector intrinsics. The vectorization level is the same as the outer loop level for the code structure showed in Algorithm 1. A major performance bottleneck in this kernel is the first order Bessel function of the first kind, commonly denoted by \(J_1\). Because of the two factors: need of a complex \(J_1\) function, and, absence of any AVX vectorized implementation of \(J_1\), we implemented this function following the sequential real-number implementation of this function as described in [1]. This function involves significant branching and loop breaks based on argument values, making it harder to vectorize efficiently. For example, to deal with branching based on the argument value, in the case of a vector, some arguments might satisfy one condition while others satisfying another. Hence, masking was used to handle such cases. Another example is the redundancy introduced with vectorization in the conditional loop breaks. For a given vector, the loop is executed for the largest number of iterations required among the vector entries, while masking the redundant computations.

Fig. 3.
figure 3

Execution times (left) of the analytical cylinder form factor kernel and corresponding speedups (right) with various vectorized versions on Cori-1: base (non-vectorized), using Intel MKL with VML modes HA, LA and EP, and using AVX intrinsics. Performance is shown for this kernel with Q-grid of size 500,000 q-points.

A basic structure of the scalar \(J_1\) function is shown in Algorithm 3. This function also takes form of a transcendental function with infinite series summations. This listing highlights the basic branching and conditional loop breaks. In a vectorized version, the input argument is a set of complex numbers, and the computations depend on the actual values of each of these numbers. Both the asymptotic and direct methods are executed in a call to this function. A masking vector is used to ensure that only the correct arguments are used in each method. For example, given an argument vector \(\langle z_0, z_1, z_2, z_3\rangle \) with \(z_0, z_2, z_3 > \mathtt {THRESHOLD}\) and \(z_1 \le \mathtt {THRESHOLD}\), a mask \(\langle 1, 0, 1, 1\rangle \) is used in the asymptotic method branch, and its complement \(\langle 0, 1, 0, 0\rangle \) is used in the direct method branch. Again, to ensure correctness with the conditional loop breaks, another mask vector, with all elements initialized to 1, is used where an element is set to 0 when the corresponding element of the compute vectors satisfy the loop break condition. Finally, the loop break is taken when all the mask vector elements are set to 0. With this manual vectorization of the analytical kernel, we were able to achieve speedups of more than 1.5\(\times \) with respect to the baseline. The performance of all these versions of this kernel, vectorization using Intel MKL and with AVX intrinsics, are summarized in Fig. 3.

figure c

5.5 Vectorizing Numerical Form Factor Kernel

Due to the presence of an additional data dimension in the numerical form factor kernels, the vectorization approach used here is quite different from the one used for the analytical kernels. A single triangle is described by its three vertices, each with three coordinates, adding to nine real numbers. Constructing vectors with the previous approach would be quite inefficient in this case due to the need for excessive padding in describing a triangle using vectors. Hence, as seen in Algorithm 2 described previously, it is most efficient to vectorize at the level of the loop over triangles. The AoSoA model is followed to encode triangles into set of vector triangles, as shown in Fig. 4. Hence, a vector triangle encodes four triangles, without the need for any padding. The array of triangles is converted to an array of vector triangles, with possible padding needed only for the last vector in the array.

Fig. 4.
figure 4

AVX vectorized representation of triangles. Each triangle is described by its three vertices (right), and each vertex is described by its three coordinate values (left). Such an AVX vector triangle structure holds information about four triangles in double precision.

Table 2. Performance of the AVX vectorized numerical form factor kernel as the execution times and speedups with respect to the base (non-vectorized) kernel on Cori-1. The shape structure used for this set of experiments consists of 2,280 triangles. Performance is shown for three different Q-grid sizes.

We have summarized the performance results for the final vectorized numerical kernel compared to the initial version in Table 2. The hand-vectorized version is able to achieve speedups of over 4.5\(\times \) in most cases, with up to a maximum of 5.2\(\times \), over the initial baseline version showing that this vectorization approach is highly effective.

6 Roofline Performance Modeling

In order to analyze the performance of the final optimized kernels with threading and vectorization, we visit the Roofline performance modeling as mentioned earlier in Sect. 3. In Roofline performance modeling, the target is to achieve performance as close to the roofline plot as possible, depending on the arithmetic intensity of a given kernel. Since arithmetic intensity is defined as the ratio of FLOPs executed to bytes moved, different levels of the bandwidth bound can be shown with respect to the memory hierarchy. We utilize Intel SDE to calculate the FLOPs and L1-cache bandwidth of the kernels. It should be noted that the L1-cache data access are generally significantly higher than the off-chip data movement (DRAM). In Fig. 5, the performance of the analytical cylinder form factor kernel is shown. It can be observed that the use of Intel MKL vector functions significantly decreased the kernel performance compared to the baseline, while the hand-vectorized version shows improved performance. It should be noted that the performance of the various versions shown are all obtained with same thread concurrency on the given systems: 24 threads on Edison and 32 threads on Cori-1 (single thread per core). Similarly in Fig. 6, we show the performance of the numerical form factor kernel. In this case, the performance improvement with the hand-vectorized version is significantly higher with respect to the baseline kernel. In this case also performance of the two versions are obtained with the same thread concurrencies.

Fig. 5.
figure 5

The analytical cylinder form factor kernel performance plotted on the Roofline model for the two systems, Edison (left) and Cori-1 (right), are shown. The baseline (non-vectorized) version and the various vector versions, Intel MKL with VML mode HA, LA and EP, and AVX-instrinsics are all shown for comparison.

Fig. 6.
figure 6

Performance of the numerical form factor kernel with respect to the Roofline model are shown for the two systems, Edison (left) and Cori-1 (right), with the baseline and the AVX-intrinsics vectorized versions.

Fig. 7.
figure 7

Roofline performance modeling of the analytical cylinder (left) and the numerical form factor kernels are shown for the Intel Knights Landing processor. Performance of the baseline and the optimized versions are shown with respect to L1-cache and MCDRAM memory bandwidths. It can be seen that use of MCDRAM is highly beneficial for these kernels in increasing the arithmetic intensities significantly.

With the increase in the number of compute cores on emerging architectures, we analyze the performance of the two kernels under study on Intel’s new Knights Landing processor testbed. The corresponding results plotted on the Roofline model are shown in Fig. 7. On this processor, we utilize the available high-bandwidth MCDRAM for the entire runs since the complete working set for these runs fits into this memory. Use of this high-bandwidth memory is able to significantly improve the arithmetic intensities, opening up even more room for computational optimizations. It should be noted that the vectorized version used in these experiments is the AVX2 versions developed above with 256-bit wide vectors. Due to the available room for performance improvement, use of 512-bit wide vectors on the KNL is expected to increase the performance, and is the next step in our work. Nonetheless, the current vectorized version KNL is showing up to 1.2\(\times \) the performance on a Cori-1 node.

7 Conclusions and Further Discussions

Although modern processors pack large performance potential, exploiting them to gain high parallel efficiency for scientific applications still remains a challenge. In this work, we primarily considered two parallelism levels, SMT and SIMD, to deliver higher-performance with a real-world scientific application code. Even though current compiler technology is able to significantly optimize code performance, there remains a large class of computations for which these technologies fail. Our application code has such computational kernels where compilers are unable to significantly optimize them for effective exploitation of on-chip parallelisms. In such cases, it is inevitable to carry out implementations in much low-level vector intrinsics if high efficiency is desired.

The work presented in this paper is highly applicable to near-future processor architectures because the two parallelism levels addressed here are among the primary areas which these new processors target to deliver higher performance. For example, Intel’s new Knights Landing manycore processor has much higher number of compute cores (\(\ge \)64). Similarly, the vector units on these processors are 512-bit wide, with AVX512 instruction set, doubling the AVX2 vector units on typical Intel Xeon processors, and effective vectorization techniques, such as those presented in this work, are necessary.