Systematic Fusion of CUDA Kernels for Iterative Sparse Linear System Solvers

Aliaga, José I.; Pérez, Joaquín; Quintana-Ortí, Enrique S.

doi:10.1007/978-3-662-48096-0_52

José I. Aliaga¹⁶,
Joaquín Pérez¹⁶ &
Enrique S. Quintana-Ortí¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9233))

Included in the following conference series:

European Conference on Parallel Processing

2631 Accesses
7 Citations

Abstract

We introduce a systematic analysis in order to fuse CUDA kernels arising in efficient iterative methods for the solution of sparse linear systems. Our procedure characterizes the input and output vectors of these methods, combining this information together with a dependency analysis, in order to decide which kernels to merge. The experiments on a recent NVIDIA “Kepler” GPU report significant gains, especially in energy consumption, for the fused implementations derived from the application of the methodology to three of the most popular Krylov subspace solvers with/without preconditioning.

You have full access to this open access chapter, Download conference paper PDF

Spliss: A Sparse Linear System Solver for Transparent Integration of Emerging HPC Technologies into CFD Solvers and Applications

Parallel Hybrid Sparse Linear System Solvers

Multiprecision Block-Jacobi for Iterative Triangular Solves

Keywords

1 Introduction

The solution of sparse linear systems [12] is an ubiquitous problem in ranking and search methodologies for the web, boundary value problems and finite element models for partial differential equations, economic modeling, and information retrieval, among others. The interest of these applications has given rise to a very large number of sophisticated sparse matrix storage layouts, libraries and algorithms for general-purpose processors (CPUs); see, e.g., [1, 7, 8, 15]. NVIDIA also supports the solution of sparse linear systems on graphics processors (GPUs), via the libraries CUBLAS and cuSPARSE, which respectively contain (CUDA) GPU kernels operating on vectors and sparse matrices.

Despite the importance of energy consumption [9, 11], few analyses of sparse linear algebra operations focus on this metric [3]. One particular source of energy inefficiency during the execution of an iterative solver [12] on a heterogeneous CPU-GPU server is that, when implemented via calls to the GPU kernels in CUBLAS/cuSPARSE, the CPU thread in control of the GPU repeatedly invokes fine-grain CUDA kernels of low cost and, therefore, short duration. Even if the solver avoids most data transfers between (the memories of) CPU and GPU, this continuous stream of kernel calls often prevents the CPU from entering an energy-efficient C-state. In [2] we introduced the fusion of GPU kernels as a means to avoid this power-hungry scenario, for the particular case of the conjugate gradient (CG) method [12]. The results in that work report significant energy gains combined with a slight improvement in performance on a platform equipped with an Intel i7-3770K plus an NVIDIA “Fermi” GTX480 board. In this paper we make the following major contributions:

We evolve [2] into a systematic analysis of the fusion of GPU kernels arising in a representative collection of sparse linear solvers: CG, BiCG and BiCGStab [12], including Jacobi-based preconditioned versions of these.
We include three alternative implementations (scalar CSR, 2-D vector CSR and ELL [6]) for the sparse matrix-vector multiplication (SpMV), with different properties/characterization which impact the possibilities of merging the corresponding solvers.
We experimentally demonstrate the benefits of kernel fusion in a platform comprising an Intel Core i3770K plus an NVIDIA “Kepler” K20c GPU.

The rest of the paper is structured as follows. In Sect. 2 we briefly review related work on the fusion of GPU kernels. In Sect. 3 we present the iterative solvers targeted in our work, identifying the mathematical operations that are implemented as CUDA kernels. Furthermore, we provide a systematic characterization of these GPU kernels, defining the properties that allow the fusion of two (or more) kernels. Finally, in Sects. 4 and 5 we respectively evaluate the new merged iterative solvers and discuss the conclusions from this work.

2 Related Work

Kernel fusion has received considerable attention in the past as an optimization technique via, e.g., increased memory locality, lower overhead by eliminating multiple calls to kernels, and richer space for compiler optimizations. For brevity, we next discuss a few efforts that specifically target fusion of GPU kernels.

In [10] the authors analyze how to fuse several types of CUDA kernels (map, reduce, and combinations of these) corresponding to BLAS-1 and dense BLAS-2 operations. Our work specifically targets iterative solvers for sparse linear systems, and leads us to consider a richer set of operations, different from those in [10]. Furthermore, we break the implementation of reduction kernels into two stages so that one of them, which concentrates most of the computational work, can still be fused.

In [14] the authors study the fusion of CUDA kernels with the purpose of improving their power-energy efficiency by accommodating a higher and better balanced utilization of the GPU cores. Three classes of fusions are identified in their paper: “inner thread”, “inner thread block”, and “inter thread block”, and their effects are simulated using two general benchmarks. Our fusions correspond to the first class as, for the type of operations arising in sparse linear algebra, this option yields a fair balance of the workload. Our approach differs in that we focus on the type of kernel fusions arising in sparse linear algebra, we provide a precise characterization of the kernels arising in this domain, and we offer experimental performance and energy results.

In [13] the authors propose the fusion of CUDA kernels arising in iterative sparse linear systems to improve performance, but only consider merging kernels that provide the same functionality and have no dependencies among them. The authors of [4] apply the techniques described in [2] to the iterative solution of sparse linear systems via BiCGStab. None of these works provides a systematic characterization of the GPU kernels and the conditions that allow their fusion.

3 Systematic Kernel Fusion for Sparse Iterative Solvers

3.1 Overview of Iterative Solvers for Sparse Linear Systems

Given a linear system \(Ax=b\), where \(A \in \mathbb {R}^{n \times n}\) is sparse, \(b \in \mathbb {R}^n\) contains the independent terms, and \(x \in \mathbb {R}^n\) is the sought-after solution, iterative projection methods based on Krylov subspaces, in combination with an appropriate preconditioner, often outperform the most efficient direct solvers available today in terms of memory consumption and execution time [12].

Concerning the computational effort of iterative Krylov subspace methods, in practical applications the cost of the iteration loop is dominated by one or two SpMV involving A. Given a sparse matrix A with \(n_z\) nonzero entries, in general the cost of the SpMV is roughly \(2n_z\) floating-point arithmetic operations (flops). Additionally, the loop body contains several vector operations that require O(n) flops each.

Figure 1 offers an algorithmic description of the preconditioned BiCG method. In general, we use Greek letters for scalars, lowercase for vectors and uppercase for matrices. There, the user-defined parameter \(\tau _{\max }\) sets an upper bound on the relative residual for the computed approximation to the solution \(x_j\), and \((z_1,z_2)\) denotes the inner product (dot) of vectors \(z_1, z_2\). The method involves two SpMV as well as several BLAS-1 (vector) operations per iteration (axpy, xpay and dot). The application of the Jacobi preconditioner matrix M requires an element-wise product of two vectors.

The preconditioned BiCG method in Fig. 1 contains all the GPU kernels that appear also in the preconditioned CG and BiCGStab. In the following section we characterize these kernels from the point of view of the type of access they perform to the data/results, we employ the preconditioned BiCG in order to present the systematic fusion of GPU kernels, and we generalize these principles to other variants of BiCG as well as other solvers.

3.2 Characterization of GPU Kernels for Sparse Iterative Solvers

A GPU kernel K performs a mapped access to a vector v if each thread of K accesses one of the elements of v, independently of other threads, and the global access is coalesced. We note that this property can be applied separately to the kernel input and output vectors. For the specific kernels identified in the sparse iterative solvers, we can then characterize their access types as shown in Table 1. For SpMV, we consider three well-known kernels/implementations [6]: scalar CSR, vector CSR and ELL.

Table 1. Types of access to the vector inputs/output of the GPU kernels.

Full size table

3.3 Fusion of GPU Kernels

We first discuss two factors that may impact the performance that can be attained by merging two GPU kernels:

Grid dimensionality (1D, 2D or 3D). For kernels that operate on vectors, this parameter has little impact on the performance. Therefore, for simplicity, a practical approach is to enforce the same dimensionality for both kernels by, e.g., setting that to the highest one of the two kernels.
Grid dimensions (number of threads per block and number of blocks). The approach here is, for simplicity, to enforce the same grid dimensions for both kernels, and to set the dimensions to the largest values employed by any of the two kernels. However, this must be done with care, as this parameter may have a real effect on the performance of the kernels.

Fusing kernels is targeted to improve performance and/or energy consumption, but obviously should produce the results of a non-fused execution. Let us elaborate now the properties that two GPU kernels, namely \(K_1\) and \(K_2\), must exhibit in order to participate in a fusion:

In case \(K_1\) and \(K_2\) do not share any data (i.e., are independent), they can always be merged.
Consider that \(K_1\) produces an result or output vector v that is also an input for \(K_2\), denoted hereafter as \(K_1 \mathop {\rightarrow }\limits ^{v} K_2\). (That is, there exists a read-after-write or RAW data dependency between \(K_1\) and \(K_2\), dictated by the type and order of shared access to vector v.) For the type of (dependent) kernels arising in the sparse iterative solvers, the fusion is possible if \(K_1\)/\(K_2\) perform a mapped access to the output/input vector v. This guarantees that (i) both kernels apply the same mapping of threads to the vector elements shared (exchanged) via registers; (ii) both kernels apply the same mapping of thread blocks to the vector elements shared (exchanged) via shared memory; and (iii) a global barrier is not necessary between the two kernels.

From the characterization in Table 1, we easily derive that axpy, xpay and JPred can be always merged with any other dependent kernel (one or more of them) of the same sort (i.e., axpy, xpay and JPred). Also, the scalar CSR and ELL versions of SpMV can be merged with any kernel of these three types that consumes the vector resulting from the product, i.e., SpMV (scalar CSR, ELL)\(\mathop {\rightarrow }\limits ^{y}K_2 \in \) {axpy, xpay, JPred} can be merged; but \(K_1 \in \) {axpy, xpay, JPred} \(\mathop {\rightarrow }\limits ^{y}\) SpMV cannot for any version of the sparse matrix-vector product.

The reduction kernel dot is a special case that needs a tailored implementation so that it can be efficiently merged in \(K_1\mathop {\rightarrow }\limits ^{y}\) dot. Concretely, in [2] we divided this kernel into two stages, say dot \(_\mathrm{ini}\) and dot \(_\mathrm{fin}\), with the first one being implemented as a GPU kernel which performs the costly element-wise products and subsequent reduction within a thread block, producing a partial result in the form of a temporary vector with one entry per block. This is followed by routine dot \(_\mathrm{fin}\), which completes the operation by repeatedly reducing the contents of this vector into a single scalar via a sequence of calls to GPU kernels. The important aspect to note at this point is that, because the reduction proceeds within blocks, this initial stage of the reduction performs a mapped read of the input vectors, and therefore can be efficiently merged in the sequence \(K_1 \in \) {axpy, xpay, JPred, SpMV} \(\mathop {\rightarrow }\limits ^{y}\) dot \(_\mathrm{ini}\). Routine dot \(_\mathrm{fin}\) is in practice implemented as a sequence of GPU kernels with mapped/unmapped input/output; see [2]. In consequence, this collection of kernels cannot be merged into a single one themselves, and dot \(_\mathrm{fin}\) \(\mathop {\rightarrow }\limits ^{y}K_2\) cannot be fused.

3.4 Fusions in BiCG

We next apply the previous fusion principles to the preconditioned BiCG with SpMV based on the scalar CSR or ELL format, and we summarize the results for the (2-D) vector CSR format and the non-preconditioned version.

The left-hand side graph in Fig. 2 identifies the dependencies (using arrows/edges) between operations of the preconditioner BiCG, with the nodes and their numeric labels identifying the operations within the loop body of the solver; see Fig. 1. (For simplicity, we do not include the operations before the loop body or the dependencies between different iterations.) As argued earlier, the dot operations (2, 9 and 14) are partitioned into two stages (a or b, corresponding respectively to kernel dot \(_\mathrm{ini}\) and routine dot \(_\mathrm{fin}\)) in order to facilitate the fusion of the first part, if possible, with a previous kernel. The node colors distinguish between the four different operation types: SpMV, dot, axpy/xpay and JPred. The patterns on top and bottom of each node specify, respectively, the type of mapping for the input and output vector(s) of each operation. Concretely, the parallel lines correspond to a mapped operator and the chessboard pattern an unmapped one. Operations 10 and 11 are special cases as they only receive/produce (input/output) one scalar and are merged into a single node.

The right-hand side graph in Fig. 2 illustrates one specific fusion of kernels among the several possibilities dictated by the kernel dependencies and the mappings of the input/output vectors. The fusions are encircled by thick lines and designate four macro-kernels: {1-2a}, {3-4-5-6-7-8-9a-14a}, {9b-10-11-14b}, {12-13}; plus a single-node (macro-)kernel: {2b}. The arrowless lines connect groups of independent kernels (e.g., 3 and 4). For simplicity, we do not include all the connections within a group. The arrows identify dependencies inside macro-kernels (e.g., from 4 to 5) and between them (e.g., from {1-2a} to {2b}).

Our fused version of the preconditioned BiCG, when SpMV employs the alternative vector CSR format (with unmapped input and output for SpMV), differs from that in Fig. 2 in that the two matrix-vector operations (kernels 1 and 6) are merged together; in addition, due to the unmapped output of kernel 1, kernel 2a becomes a single-node macro-kernel. The resulting macro-kernels are therefore: {1-6}, {2a}, {2b}, {3-4-5-7-8-9a-14a}, {9b-10-11-14b} and {12-13}. Also, for all variants of the BiCG solver (based on scalar CSR, vector CSR and ELL SpMV), the fusion graphs of their non-preconditioned counterparts simply differ in that kernels 5 and 8, corresponding to the application of the preconditioner, are not present.

These particular fusions were chosen following the fusion principles exposed in this section and some general performance guidelines:

The fusions can be decided by performing a systematic analysis of each kernel, starting e.g. at 1, 2, etc., with those labeled with a higher number, taking into account the dependencies and the type of input/output (mapped or unmapped). In general, the strategy is to reduce as much as possible the total number of macro-kernels, in order to avoid the associated performance and energy overheads. For the preconditioned BiCG, the right-hand side graph in Fig. 2 presents the minimum number of macro-kernels due to the restrictions imposed by the unmapped output vectors of the three dot operations (2a/b, 9a/b and 14a/b). We note that 10+11 could have been instead merged with {12-13} but we selected the first option for performance reasons.
The dependencies between operations within the same macro-kernel specify a partial order for their execution. In principle, independent kernels are merged by integrating their instructions into a single code one after another. As an exception, for performance reasons, when the initial or final stages of two independent dot operations are merged together into a single macro-kernel (e.g., 9a with 14a; and also 9b with 14b), their instructions are interleaved in the code. (Interleaving of multiple dot operations was proposed in [4].)
Alternatively, 6 can be merged with {1-2a}, but this option was discarded because, for the scalar CSR and ELL implementations of SpMV, the result attained lower performance.

3.5 Fusions in CG and BiCGStab

Figure 3 presents the fusion graphs for the preconditioned versions of CG and BiCGStab^{Footnote 1} when SpMV is based on the scalar CSR or ELL format. For the CG solver, the only difference when SpMV employs the vector CSR format is that kernels 1 and 2a become two separate single-node macro-kernels. The same applies to the two SpMV in BiCGStab, i.e. kernels 1 and 5, which become an isolated macro-kernel each. As in the BiCG solver, the non-preconditioned versions of CG and BiCGStab differ in that the nodes corresponding to the preconditioner application (5 for the former and 4, 13 in the latter) disappear.

The graphs in Fig. 3 contain the minimum number of macro-kernels. Due to stricter dependencies of CG and BiCGStab compared with BiCG, the number of alternative fusions in the former two is reduced to instead joining 7+8 with 9 in CG, and 10+11 with 12-13 in BiCGStab.

In summary, the study of this collection of cases (three solvers, with and without preconditioner, and three different implementations of SpMV) exposes that, for the type of operations involved in these iterative solvers, the two stages of the dot operations act as barriers (or synchronization points), enforcing a particular fusion/division of the macro-kernels.

4 Experimental Evaluation

In this section we evaluate the performance and energy gains of the merged solvers, comparing them with non-fused counterparts. For this purpose, we employ several sparse matrices from the University of Florida Matrix Collection (UFMC)^{Footnote 2} and a difference discretization of the 3D Laplace problem; see Table 2. The coefficient matrix A for audikw_1 and inline_1 is too large to be stored in the ELL format and these combinations of matrix case/storage format are excluded from the evaluation. Moreover, A is unsymmetric for fem_3dth2 and, therefore, cannot be tackled via the CG solver. For all cases, the solution vector was chosen to have all entries equal 1, and the independent vector was set to \(b=Ax\). The iterative solvers were initialized with the starting guess \(x_0=0\). All experiments were done using ieee single precision (SP) arithmetic. While the use of double precision (DP) arithmetic is in general mandatory for the solution of sparse linear systems, the use of mixed SP-DP in combination with iterative refinement leads to improved execution time and energy consumption when the target platform is a GPU accelerator [5].

Table 2. Description and properties of the test matrices from the UFMC (left) and the 3D Laplace problem (right). In the matrix names, fem_3dth2 corresponds to the “FEM 3D nonlinear thermal problem”.

Full size table

The target architecture is a Linux server (CentOS release 6.2 with kernel 2.6.32) equipped with a single Intel Core i7-3770K CPU (3.5 GHz, four cores) and 16 Gbytes of DDR3 RAM, connected via a PCI-e 2.0 bus to an NVIDIA “Kepler” K20c GPU (compute capability 3.5, 706 MHz, 2,496 CUDA cores) with 5 GB of GDDR5 RAM integrated into the accelerator board. Power was collected using a National Instruments (NI) Data Acquisition System, composed of the NI9205 module and the NIcDAQ-9178 chassis, and plugged to the lines that connect the output of the power supply unit with motherboard and GPU.

In total, we evaluated CG, BiCG and BiCGStab, with and without preconditioning, using three different implementations of SpMV (scalar CSR, vector CSR and ELL), and five different versions of each solver:

cublasL is a plain version of the solver implemented via calls to CUBLAS kernels from the legacy programming interface of this library, combined with ad-hoc implementations of SpMV. In this version, one or more scalars may be transferred between the main memory and the GPU memory address space each time a kernel is invoked and/or its execution is completed.
cublasN is an evolved version of the previous implementation that, whenever possible, maintains the scalars in the GPU memory (via the new interface of CUBLAS), in order to avoid unnecessary communication/synchronization between CPU and GPU.
cuda replaces the CUBLAS (vector) kernels in the previous version by our ad-hoc implementations, including the two-stage dot.
merge applies the fusions described in Sect. 3.
merge_10 applies the fusions as well and, in addition, only checks the convergence every 10 iterations of the solver, thus reducing the amount of synchronizations between CPU and GPU due to the evaluation of this test.

In review, there are 3 solvers, 2 preconditioning modes, 3 implementations of SpMV, and 5 versions of the solver; i.e., 90 combinations. Furthermore, we execute these configurations under the polling and blocking CUDA synchronization modes, and evaluate them for 12 test matrices (11 for CG), collecting the time and energy per iteration for each scenario. In order to reduce the number of results to show, (i) we report the variations in time/energy of the different implementations with respect to cublasL executed in polling mode; (ii) in addition, we summarize the results for the matrix test cases into a single average value, giving the same weight to all of matrix tests; and (iii) finally, we consider only the vector CSR implementation of SpMV for the UFMC cases and the ELL variant for the Laplace problems since our experiments showed that these are the best options from the point of view of performance.

With these considerations, Fig. 4 reports the time and energy variations for three solvers (CG, BiCG, BiCGStab) with/without preconditioning and five versions of each (cublasL, cublasN, cuda, merge, merge_10), executed under two different synchronization modes (polling and blocking).

The first aspect to note is that all plots in Fig. 4 reflect the same qualitative trend, independently of the specific solver and whether or not the preconditioner is present. Let us consider, e.g., the top-left plot (CG solver without preconditioner). Compared with the baseline case (cublasL executed in polling mode), the two non-fused versions cublasN and cuda only experience a slight increase in both time and energy (around 1 % and 2 %, resp.) when operating under the polling mode. For the alternative blocking mode, these versions present an appealing reduction of the energy consumption (above 9 %), but unfortunately this comes at the cost of a more visible performance penalty (a time increase superior to 6 %). The desired combination (reduction in both time and energy) is attained by the merged versions (merge and merge_10). Both algorithms report a decrease of execution time superior to 5 %, except for merge executed in blocking mode, for which the variation of time is negligible. The best combination is clearly merge_10, which combines this reduction of time with a remarkable decrease of energy consumption, superior to 15 %.

In general, the best option is to employ merge_10 executed in blocking mode. Compared with the baseline case, the reduction in time for all solvers and preconditioning modes is between 5.1 % and 10.2 %, while from the energy perspective the savings vary between 4.0 % and 20.0 %. Comparing merge_10 with the same implementation executed in polling mode, the blocking mode basically matches its performance (around the same execution time) while producing higher energy gains, especially for CG and BiCGStab.

5 Concluding Remarks

We have introduced and applied a systematic methodology to derive fused versions of three popular iterative solvers (with and without preconditoning) for sparse linear systems. An analysis of the type of access that the threads in charge of a kernel’s execution perform on the kernel inputs and outputs, together with the observation of the data dependencies between kernels, determine the candidates to be fused. For performance and energy efficiency reasons, the general goal is to minimize the number of macro-kernels that results from the application of the fusions. From this point of view, we obtain reductions from 10\(\rightarrow \)5, 13\(\rightarrow \)5 and 14\(\rightarrow \)8 for the preconditioned versions of CG, BiCG and BiCGStab, respectively. The gains are experimentally demonstrated on a recent CPU-GPU architecture, consisting of an Intel “Sandy-Bridge” multicore processor and an NVIDIA “Kepler” GPU. Compared with plain versions of the solvers based on CUBLAS and ad-hoc implementations of SpMV, the fused versions attain remarkable energy savings when executed in blocking mode. Furthermore, in general they match the performance of an execution of the same versions when executed in the performance-active but power-hungrier polling mode.

Notes

1.
For BiCGStab, nodes 7 and 12 of the graph actually embed two dependent operations of type axpy/xpay each. For brevity, they are represented with a single node each.
2.
http://www.cise.ufl.edu/research/sparse/matrices/.

References

CSB library (2014), http://gauss.cs.ucsb.edu/aydin/csb/html/
Aliaga, J.I., Pérez, J., Quintana-Ortí, E.S., Anzt, H.: Reformulated conjugate gradient for the energy-aware solution of linear systems on GPUs. In: 42nd International Conference on Parallel Processing (ICPP), pp. 320–329 (2013)
Google Scholar
Aliaga, J.I., et al.: Unveiling the performance-energy trade-off in iterative linear system solvers for multithreaded processors. Concurrency and Computation: Practice and Experience (2014, to appear)
Google Scholar
Anzt, H., Sawyer, W., Tomov, S., Luszczek, P., Yamazaki, I., Dongarra, J.: Optimizing Krylov subspace solvers on graphics processing units. In: IEEE International Parallel Distributed Processing Symposium Workshops (IPDPSW), pp. 941–949 (2014)
Google Scholar
Anzt, H., et al.: Analysis and optimization of power consumption in the iterative solution of sparse linear systems on multi-core and many-core platforms. In: International Green Computing Conference Workshops (IGCC), pp. 1–6 (2011)
Google Scholar
Bell, N., Garland, M.: Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical report NVR-2008-004, NVIDIA Corp., December 2008
Google Scholar
Buluç, A., Williams, S., Oliker, L., Demmel, J.: Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 721–733 (2011)
Google Scholar
Choi, J.W., Singh, A., Vuduc, R.W.: Model-driven autotuning of sparse matrix-vector multiply on GPUs. In: ACM SIGPLAN Symposium Principles and Practice of Parallel Programming (PPoPP), vol. 45, pp. 115–126 (2010)
Google Scholar
Duranton, M., et al.: HiPEAC vision 2015. High performance and embedded architecture and compilation (2015). http://www.hipeac.net/vision
Filipovic, J., Madzin, M., Fousek, J., Matyska, L.: Optimizing CUDA code by kernel fusion–application on BLAS. Computing Research Repository (CoRR) abs/1305.1183 (2013). http://arxiv.org/abs/1305.1183
Fuller, S.H., Millett, L.I.: The Future of Computing Performance: Game Over or Next Level? National Research Council of the National Academies (2011)
Google Scholar
Saad, Y.: Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia (2003)
Book Google Scholar
Tabik, S., Ortega, G., Garzón, E.: Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study. J. Supercomputing 70(2), 577–587 (2014)
Article Google Scholar
Wang, G., Lin, Y., Yi, W.: Kernel fusion: An effective method for better power efficiency on multithreaded GPU. In: Green Computing and Communications (GreenCom), pp. 344–350 (2010)
Google Scholar
Williams, S., Bell, N., Choi, J., Garland, M., Oliker, L., Vuduc, R.: Sparse matrix vector multiplication on multicore and accelerator systems. In: Kurzak, J., Bader, D.A., Dongarra, J. (eds.) Scientific Computing with Multicore Processors and Accelerators. CRC Press (2010)
Google Scholar

Download references

Acknowledgements

This research was supported by projects EU FP7 318793 (Exa2Green) and TIN2011-23283 of the Ministerio de Economía y Competitividad and EU FEDER. We thank Hartwig Anzt from the University of Tennessee for his comments.

Author information

Authors and Affiliations

Dpto. de Ingeniería y Ciencia de Computadores, Universitat Jaume I, 12071, Castellón, Spain
José I. Aliaga, Joaquín Pérez & Enrique S. Quintana-Ortí

Authors

José I. Aliaga
View author publications
You can also search for this author in PubMed Google Scholar
Joaquín Pérez
View author publications
You can also search for this author in PubMed Google Scholar
Enrique S. Quintana-Ortí
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to José I. Aliaga .

Editor information

Editors and Affiliations

Vienna University of Technology, Vienna, Austria
Jesper Larsson Träff
Vienna University of Technology, Vienna, Austria
Sascha Hunold
Vienna University of Technology, Vienna, Austria
Francesco Versaci

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aliaga, J.I., Pérez, J., Quintana-Ortí, E.S. (2015). Systematic Fusion of CUDA Kernels for Iterative Sparse Linear System Solvers. In: Träff, J., Hunold, S., Versaci, F. (eds) Euro-Par 2015: Parallel Processing. Euro-Par 2015. Lecture Notes in Computer Science(), vol 9233. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48096-0_52

Download citation

DOI: https://doi.org/10.1007/978-3-662-48096-0_52
Published: 25 July 2015
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-48095-3
Online ISBN: 978-3-662-48096-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics