Keywords

1 Introduction

The matrix-vector product is an important numerical kernel as well as one of the 7\(+\) dwarfs [3] proposed for the evaluation of parallel programming models and architectures. In particular, the sparse instance of the matrix-vector multiplication (SpMV) underlies the HPCG benchmark [8], and is also a crucial kernel for the solution of linear systems and eigenvalue problems arising in many scientific and engineering applications [21]. Furthermore, the connection between sparse linear algebra and graph algorithms has been recently exploited by a new class of algorithms, based among others on SpMV, to tackle the vast volume of information that is common in social networks and other data analytic processes [5, 14].

In this paper, we perform a complete experimental analysis of a generic implementation of the matrix-vector product on a current multi-threaded architecture with support for dynamic voltage-frequency scaling (VFS). Our study focuses on the energy efficiency using the energy-per-flop metric as reference. This analysis is timely because even though the matrix-vector product kernel has been extensively analyzed and optimized from the point of view of performance, the number of studies that investigate its energy consumption is limited. This is especially relevant since power/energy are factors which constrain the performance of current processor designs for the high performance computing systems running the aforementioned numerical and graph-related applications [9, 17].

In particular, this work makes the following specific contributions with respect to the energy optimization of SpMV:

  • We analyze several critical parameters with respect to matrix size, sparsity degree, and non-zero clustering of the sparse matrix that drive the energy efficiency of this kernel on a modern multithreaded architecture.

  • We derive a simple recipe to optimize VFS for a SpMV operation involving any given sparse matrix that exploits the aforementioned relevant parameters from the graph representing the sparse matrix.

  • We demonstrate the robustness of our VFS optimization recipe by applying it to optimize the energy performance of the entire University of Florida Sparse Matrix Collection (UFMC) [1] on the IBM Power8.

  • We evaluate the energy gains that an appropriate adaption of VFS can yield with respect to an energy-oblivious approach that considers performance as the only optimization goal.

Our driving motivation is the design of an easy-to-use and widely-applicable strategy to significantly reduce energy consumption of SpMV without the need of a per-case analysis. Indeed, finding the optimal VFS for each individual sparse problem is unrealistic as well as unaffordable in practical production scenarios, where the sparsity pattern may change from one application to the next, and time cannot be spent on preliminary fine tuning energy-benchmarks.

The rest of the paper is structured as follows. In Sect. 2 we briefly review some related works and next, in Sect. 3, we describe the experimental setup. In Sects. 4 and 5 we analyze the energy consumption of the sparse matrix-vector kernel with dense and sparse matrices, respectively. We close the paper with a few concluding remarks in Sect. 6.

2 Related Work

There exists a large volume of work addressing the performance optimization of SpMV; see, e.g., [5, 16, 22, 23] and the references therein. Among others, pOSKI [6] is a multithreaded library for SpMV that leverages automatic search over multiple implementations and sparse data layouts to optimize performance on multicore processors. Model-driven optimization of SpMV for data-parallel accelerators has been studied in [7, 10].

The number of efforts dedicated to the energy modeling and optimization of SpMV is significantly more reduced. In [20] we introduced a systematic methodology to derive reliable time and power models for algebraic kernels employing a bottom-up approach. However, the recipe resulted a bit cumbersome to leverage, requiring a large number of calibration tests. In [19] we devised a systematic machine learning algorithm to classify and predict the performance and energy costs of the SpMV kernel, but did not consider the effect of VFS. The work in [2] presents an extensive experimental study of the interactions occurring in the triangle performance-power-energy for the execution of a pivotal numerical algorithm, the iterative Conjugate Gradient (CG) method, on an ample collection of parallel multithreaded architectures. However, that work does not produce a recipe to optimize VFS for any given sparse problem. Other work related to modeling sparse linear algebra operations can be found in [11, 13, 18].

3 Experimental Setup

3.1 Hardware

The target platform for our analysis is the IBM Power System S812L. The POWER8 processor in this system, fabricated on 22 nm silicon, features 2 sockets, each with 6 cores, offering hardware support for up to 8-way simultaneous multi-threading (SMT) as well as dynamic VFS. Each core in the IBM POWER8 is furnished with a 512-KB L2 cache. Furthermore, the chip contains a shared L3 cache of 8 MB per core, and 16 MB of L4 cache per buffer, with up to 8 buffers per socket [4]. The server was also equipped with 64 GB of DDR RAM.

Table 1 displays the frequency-voltage configuration pairs and the idle power dissipated by the system, measured during the execution of a sleep test,Footnote 1 for two scenarios: socket+DDR (“SD”) only vs the full server (“Node”). Power measures were obtained using AMESTER [15]. This tool runs on a separate server and connects to the service processor of the node in order to obtain voltage/frequency/power-per-core samples from several sensors, while avoiding interference with the workload. Using this information, we calculate the time-per-flop and net power-per-core (without the idle power), yielding the net energy per-flop-and-core from their product. All energy consumption values reported next refer to the net energy-per-flop and core.

The codes were compiled using IBM’s mpcc 13.01.0003.0000, with the flags:

-O3 -qprefetch=dscr=0 -qhot -qstrict -qsmp=noauto:omp -qthreaded -qsimd=

auto -qaltivec -q64 -qarch=pwr8. Each test was repeated for at least 60 s, and the results average the values from these runs. The experiments targeted a single “socket” of the IBM POWER8 chip (i.e., 6 cores), with either 1 thread or 8 threads per core (1-SMT or 8-SMT, respectively), and all threads/cores collaborating to compute one instance of SpMV.

Table 1. Voltage-frequency pairs and idle power in the IBM POWER8 processor.

3.2 Kernel and Implementation

We analyze an implementation of the SpMV \(y:=A\cdot x\), with sparse matrix \(A\in \mathbb {R}^{n\times n}\) and dense vectors \(x,y\in \mathbb {R}^n\), based on the CSR (compressed sparse row) storage format [21]. For sparse matrices, CSR offers a fair balance between compression efficiency (as it is one of the most efficient formats for generic sparse matrices on cache-based microprocessors) and architecture-independent performance (since it does not directly exploit graph characteristics that may emerge from the specific physical problem) [22]. The CSR data layout employs a real array for the values of A (A_val), and two auxiliary integer arrays (col_ind and row_ptr) to maintain (respectively) the column indices of the nonzero entries in A and the initial/final index of each row of A within the other two arrays [21]. All our experiments employ double precision floating-point arithmetic so that the values of Axy occupy \(s_d=8\) bytes each. Each component of the indexing integer arrays occupies \(s_i=4\) bytes. Therefore, storing an \(n \times n\) matrix with \(n_z\) nonzero entries in this format requires \(M_S=n_z(s_d+s_i)+(n+1)s_i\) bytes, and xy occupy \(V_S=ns_d\) bytes each.

The implementation of SpMV in CSR format is illustrated in Fig. 1. The optimization is left to the compiler, except for some minor details omitted for simplicity. The parallelization strategy distributes the computation of the entries of y among the threads/cores (via the OpenMP #pragma omp directive before the outer loop).

Fig. 1.
figure 1

SpMV based on the CSR format.

In the operation \(y:=A\cdot x\), there is no reuse of the entries of A and the only opportunity to exploit data locality is in the accesses to xy. In the CSR version of SpMV, the entries of A_val and col_ind are streamed from the memory layer where they reside into the processor register file with unit stride; each entry of y is loaded into a register once and re-used until it has been completely updated; and the re-use factor of x depends on the sparsity pattern of A.

4 Tuning VFS for the Dense Matrix-Vector Product

We commence our analysis by considering an \(n \times n\) dense matrix-vector product kernel, GeMV, computed via the code in Fig. 1. While a dense matrix can be more efficiently stored as a conventional 1-D array, in column- or row-major order, this initial study will offer us some preliminary insights on the energy behavior of this memory-bound operation. For the following experiments, we use two square dense matrices, of dimension n = 312 and 30, 512 (with \(n_z=n^2\) nonzeros). Taking into account the CSR memory layout, and the fact that all cores collaborate in the execution of the same matrix-vector product, the data for these two problems respectively requires about 1.15 MB and 10.6 GB. Thus, the small case easily fits into the on-chip L3 cache (8 MB/core), while the larger problem can only be stored in the off-chip DDR RAM.

Scaling the voltage and frequency (VFS) can be expected to produce an effect on performance and power dissipation which, in turn, produces an impact on the energy consumption. In principle, one could expect that a change of frequency results in a proportional variation of the GFLOPS (billions of flops per second). The left-hand side plot in Fig. 2 investigates the behaviour as the frequency is increased, and the socket is populated with 1 or 8 threads per core (1-SMT and 8-SMT). To capture the theoretical linear relation between the GFLOPS and the frequency, both metrics are normalized in the figure with respect to those observed for \(C_1\). On one hand, when operating with the off-chip problem, the GFLOPS rate attained with 1-SMT stagnates for the two higher frequency rates while, for 8-SMT, the performance does not vary with the frequency. These results show scenarios where the DDR bandwidth is saturared for our GeMV code. On the other hand, the GFLOPS rate grows linearly with the frequency for the L3 on-chip case, independently of the number of threads.

Fig. 2.
figure 2

Performance (left) and SD power consumption (right) for GeMV. (Color figure online)

The analysis from the point of view of power dissipation is more complex. Concretely, for a given voltage-frequency configuration pair \(C_{\small {\text{ A }}}=(V_{\small {\text{ A }}},F_{\small {\text{ A }}})\), the power dissipation can be decomposed into its static and dynamic components which depend, respectively, on \(V_{\small {\text{ A }}}^2\) and \(V_{\small {\text{ A }}}^2\cdot F_{\small {\text{ A }}}\) [12]. We can assume that the idle power (see Table 1) is mostly due to leakage (static power), while a substantial fraction of the net power is due to the application’s activity (dynamic power). The right-hand side plot in Fig. 2 compares the experimental net power ratio \(P^{\tiny {{\textsc {net}}}}_{\small {\text{ A }}}\), normalized with respect to that observed when running in \(C_1\), against the theoretical VFS ratio \((V_{\small {\text{ A }}}^2\cdot F_{\small {\text{ A }}})/(V_1^2\cdot F_1)\) during the execution of GeMV. Considering only the 8-SMT cases, there is a clear difference between the L3 case, which shows a perfect match between the theory and the experimental behaviour, and the DDR problem, for which the net power grows at a much lower pace due to the memory bottleneck.

Table 2 illustrates the combined effect of the variations in performance and power dissipation on the energy consumption, identifying the best VFS configuration depending on the problem dimension. The values there are again normalized with respect to those observed for the configuration \(C_1\). Hereafter, we only report the results obtained with 8-SMT, as they are consistently superior to those obtained with 1-SMT in both performance and energy efficiency.

Table 2. Normalized energy consumption with respect to \(C_1\) for 8-SMT and different VFS levels. Best choices are highlighted in green.

The previous experiments reveal that the execution time as well as the power consumption of GeMV can be accurately modeled when the problem data fits on-chip. This leads to a straight-forward derivation of energy-related metrics (such as our target net energy-per-flop) and, as illustrated in Table 2, paves the road to a direct optimization of VFS from the perspective of energy efficiency. The behavior of performance and power consumption is more difficult to predict for the DDR problems; however, if the goal is to optimize energy efficiency, the best strategy for GeMV simply runs this kernel at the lowest voltage-frequency pair. Compared with this energy-aware VFS configuration, an execution of GeMV that simply aims to enhance performance consumes significantly more energy: 1.43\(\times \) for SD and 1.32\(\times \) for the node.

5 Tuning VFS for the Sparse Matrix-Vector Product

Unfortunately, the sparsity exhibited by most real applications generally results in irregular access patterns (in CSR, to the entries of x), which may render an imbalanced workload distribution yielding the guidelines derived to adjust VFS for GeMV in Sect. 4 suboptimal for the sparse case.

In [19], we identified a reduced set of critical structural parameters which impact the performance, power, and energy consumption of the CSR implementation of SpMV. These properties were used to generate a small synthetic sparse benchmark (or training set) which was then employed to build a model that accurately predicts performance and energy consumption of any sparse problem. In the following, we investigate whether the same approach, based on a synthetic sparse training set, can produce a strategy to select a (close-to-)optimal VFS configuration for any sparse problem.

5.1 Training Set

In order to determine an appropriate configuration for SpMV, we leverage the benchmark introduced in [19], which is characterized by five parameters:

  • Number of rows/columns n and nonzeros \(n_z\).

  • Block size: \(b_s\). Many applications lead to sparse matrices where the non-zeros are clustered into a few compact dense blocks in each row. This parameter specifies the number of entries in these blocks, and determines the number of elements of x accessed with unit stride.

  • Block density: \(b_d= b_s/n_{zr}= b_sn/n_z \in [0,1]\) is the inverse of the number of blocks per row. With \(b_s\) fixed, \(b_d\) defines the re-use factor for y.

  • Row density: \(r_d= n_{zr}/n \in [0,1]\) is the number of non-zeros per row relative to the row size. With \(n_{zr}\) fixed, \(r_d\) is an indicator of the probability of finding an entry of x already fetched into a higher level of the cache hierarchy during the computation of a previous entry of y.

For the analysis, we distribute the matrix instances of the training set evenly in the \(\log _2\) space comprised by \(b_s\in \{2^0,2^2,2^4,\ldots ,2^{14}\}\), \(b_d\in \{2^{0},2^{-2},\ldots ,2^{-14}\}\), and \(r_d\in \{2^{0},2^{-2},\ldots ,2^{-28}\}\). Thus, for a matrix in the training set, the triplet-coordinates \((b_s,b_d,r_d)\) identify a matrix of dimension \(n=n_{zr}/r_d= b_s/(b_dr_d)\) with \(n_z=n\,n_{zr}\) nonzeros. As \(b_s\le n_{zr}\le n\), this distribution yields a total of 162 samples only, yet offers enough variability to characterize sparse matrices from real applications while avoiding a costly calibration.

Figure 3 illustrates a compact representation of the training set, with each matrix identified by a single point \((b_s,b_d,r_d)\) in the 3-D space. The colors of the points identify the optimal VFS configuration, from the perspective of energy consumption, taking into account two scenarios: SD power only or the full node consumption. The matrix instances in the training set are divided into three categories, according to the size of the SpMV data (\(P_S=M_S+2V_S\)): “L3”, “DDR”, and “transition”. The first category includes those instances where the matrix data occupies less than 18 MB and involve small vectors xy that easily fit into the L3 cache; the second category contains those instances requiring more than 200 MB, which therefore can only reside in the DDR; all the remaining cases are assigned to the transition category.

Fig. 3.
figure 3

Optimization of SD (left) and node (right) energy consumption via VFS for the synthetic training set, with matrices in the L3, DDR, and transition categories (top, middle, and bottom, resp.). Color codes: ; ; ; ; ; . (Color figure online)

These results show that, for any scenario/category, there is no single configuration that optimizes VFS for all cases in it. For brevity, let us examine the energy consumed by the SD (scenario) in some detail (left column of plots):

  • L3: in most cases, the \(C_5\) configuration is optimal.

  • DDR: the \(C_1\) configuration is possibly the best compromise solution. However, we can observe that, as the row density \(r_c\) decreases, the \(C_2\) configuration seems to offer a fair alternative.

  • Transition: the figures expose no clear winner in this case.

The plots in Fig. 3 offer a quick glance of the impact of VFS depending on the scenario (SD or Node) and category (L3, DDR or transition). We next refine this information to quantify the overhead incurred by an approach that chooses a single configuration for each scenario/category combination. Consider in particular a category consisting of s problem instances \(P=\{p_1,p_2,\ldots ,p_s\}\), which are executed under a given VFS configuration \(C_{\small {\text{ A }}}\) and scenario S, resulting in a range of values for the energy consumption: \( E^S \in \{E^{\mathrm{SD}},E^{\mathrm{Node}}\}. \) Let us denote by \(C_{{{\textsc {Opt}}}}(p,S)\) the configuration that minimizes energy consumption for a given problem instance/scenario p / S. With these premises, Table 3 reports the relative deviation from the energy-optimal VFS configuration for each problem category and scenario, given (in %) by

$$ \frac{100}{s} \cdot \sum _{i=1}^s \left( \frac{E^S(p_i,C_{\small {\text{ A }}}) - E^S(p_i,C_{{{\textsc {Opt}}}}(p_i,S))}{E^S(p_i,C_{{{\textsc {Opt}}}}(p_i,S))}\right) . $$

The “L3”, “DDR”, and “Trans(ition)” columns in the table thus show the additional energy (overhead in %) that is spent if one chooses a single configuration for all problems in the category, instead of the specific optimal configuration for each problem instance/scenario. Let us analyze one of the scenarios in more detail, namely the SD energy consumption. The results in the table reveal that this excess is acceptable for the instances in the L3 and DDR categories when choosing \(C_5\) (0.9 %) and \(C_1\) (2.4 %), respectively, but they also expose a significant loss if a single configuration is selected for all cases in the transition category (15.3 % in the best case, corresponding to \(C_2\)).

Table 3. Deviation (in %) of energy consumption with respect to the optimal configuration for the synthetic training set. Best choices are highlighted in green.

Motivated by the blurry behaviour of the instances in the transition category, we next analyze these cases in more detail. Table 4 reports the SD energy consumption for the 21 instances in this problem category, normalized with respect to the configuration \(C_1\). The values in the table show no clear relation between problem size, defined by n and \(P_S\), and the optimal configuration for these instances. On the other hand, among the remaining three problem parameters, only the row density seems to play a role, as high row densities (HRD: \(r_d > 2^{-14}\)) favour high frequencies, while low row densities (LRD: \(r_d \le 2^{-14}\)) benefit more from a lower frequency. This criteria pushes us to split the transition category into two subcategories, LRD and HRD, yielding the deviations in the columns labelled as “HRD” and “LRD” of Table 3, which reveal a significant improvement for the LRD subcategory but no relevant gain for the HRD. To conclude this analysis of the transition cases, we note that LRD consist of just 6 instances which may be too small to derive a strong conclusion.

Table 4. Normalized energy consumption of SD with respect to \(C_1\) for 8-SMT and different VFS levels. Best choices are highlighted in green.

5.2 Validation with UFMC

The synthetic benchmark is only a small collection of sparse problems, that aims to provide a rough approximation of the sparsity patterns present in real applications, in order to offer some guidance on the energy-optimal VFS configuration. The motivation for this is that choosing the optimal VFS for each individual sparse problem can be unrealistic, as that may require to execute each case at each VFS level to make an appropriate choice. As an alternative, we trade off accuracy (and, therefore, energy efficiency) for flexibility, by using the classification and VFS configuration obtained with synthetic benchmark to dictate the selection for the real applications.

We validate the energy efficiency that can be attained if we base our VFS selection for real sparse matrices on the previous problem classification. For this purpose, we employ 1,202 problem instances from very different real applications in the set UFMC. Among these, 1,044 cases fit into the L3 cache, 75 can be classified as DDR cases, and the rest belong to the transition category, with 69 in HRD and 14 only in LRD.

Table 5. Deviation (in %) of energy consumption and performance with respect to the optimal configuration for the UFMC. Best choices are highlighted in green.

Table 5 displays the relative deviation from the energy-optimal VFS configuration and the performance-optimal configuration for the problem instances in the UFMC. Regarding the energy, when comparing the optimal global VFS configuration in the table with those in Table 3, we observe that the training set did actually offer an appropriate guidance to select the energy-optimal VFS for the L3, DDR and HRD cases, for both the SD and node scenarios. On the other hand, the global energy-optimal VFS options for the UFMC problem instances in the LRD category are \(C_5\) (SD) or \(C_6\) (node) instead of those pointed out by the synthetic benchmark. With respect to performance, that when applying the energy-optimal VFS this metric decreases up to 11.63 %, except for the LRD cases where it could reach 37.44 %. Again, this is due to the fact that the training set prediction does not match the energy-optimal VFS for the UFMC cases that fall into the LRD category. At this point, it is worth reminding that the synthetic collection included only 6 problem instances in this category, with the data for five of them occupying in the range of (\(P_S\)=)53–68 MB. Compared with this, the UFMC set has 14 instances in the LRD category, but only 4 of this are in the same \(P_S\)-dimension range. The fact that the samples in the (synthetic and real) LRD category are few, and that the problem sizes for the synthetic set and the UFMC do not overlap, explains the different energy-optimal VFS configuration determined for each case. However, we emphasize that the LRD category contains only 14 cases out of 1,202 real problems, which is less than 1.2 %! For the remaining 98.8 % cases, the training set did actually identify a fair classification into categories as well as offer a good VFS selection.

A final question to investigate is the balance between the energy gains vs the performance loss that an energy-aware VFS configuration, based on the categories/VFS levels determined from the previous experimental studies, can yield compared with a conventional performance-oriented VFS selection that simply runs all (real) cases at the highest VFS level. Table 6 shows that, when considering the socket+DDR (SD), an energy-aware VFS configuration can yield savings between 3.0 % and 21.0 % with respect to the performance-oriented option, and more reduced if we consider the full node consumption. These savings come at a certain cost from the perspective of execution time, reporting a loss of the energy-aware VFS configuration with respect to the performance-oriented case that is between 9.6 % and 22.7 %.

Table 6. Average energy savings vs performance loss between the energy-optimal and the performance-optimal VFS configurations for the UFMC (denoted as \(C_E\) and \(C_P=C_6\), respectively). From previous experiments with UFMC, \(C_E=C_5\) for the problem instances in the L3 and transition categories, while \(C_E=C_1\) for the DDR cases. Performance is measured in time-per-flop.

6 Concluding Remarks

Voltage-frequency scaling (VFS) is an energy-oriented technology present in current hardware that the operating system/programmer can leverage to adapt the execution pace of an application without modifying the code. Unfortunately, selecting the energy-optimal VFS configuration is both architecture- and application-dependent. For the (sparse) matrix-vector product kernel, our work shows that it is possible to rely on a portable benchmark, consisting of a reduced number of synthetic sparse matrices, to establish a general classification of the problems data, (according to criteria related to problem dimension and sparsity pattern,) and to determine a global energy-optimal VFS configuration for the matrices in each group. Our experiments on a multicore server equipped with an IBM POWER8 show a strong dependence between energy consumption and problem dimension, exposing an interesting trade-off between energy efficiency and performance for this particular kernel.

Our work also analyzed the energy-delay product, with similar conclusions to those presented in the paper for the energy efficiency. As part of future work, we plan to investigate the energy savings that can be attained with a limited loss in performance as well as prediction of the optimal level of concurrency throttling from the point of view of energy efficiency.