Effective Minimally-Invasive GPU Acceleration of Distributed Sparse Matrix Factorization

Gupta, Anshul; Gimelshein, Natalia; Koric, Seid; Rennich, Steven

doi:10.1007/978-3-319-43659-3_49

Anshul Gupta¹⁵,
Natalia Gimelshein¹⁶,
Seid Koric¹⁷ &
…
Steven Rennich¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9833))

Included in the following conference series:

European Conference on Parallel Processing

2157 Accesses
3 Citations

Abstract

Sparse matrix factorization, a critical algorithm in many science and engineering applications, has had difficulty leveraging the additional computational power afforded by the infusion of heterogeneous accelerators in HPC clusters. We present a minimally invasive approach to the GPU acceleration of a hybrid multifrontal solver, the Watson Sparse Matrix Package, which is already highly optimized for the CPU and exhibits leading performance on distributed architectures. The novel aspect of this work is to demonstrate techniques for achieving substantial GPU acceleration, up to 3.5x, of the sparse factorization with strategic, but contained changes to the original, CPU-only, code. Strong scaling results show that performance benefits scale to as many as 512 nodes (4096 cores) of the Blue Waters supercomputer at NCSA. The techniques presented here suggest that detailed code reorganization may not be necessary to achieve substantial acceleration from GPUs, even for complex algorithms with highly irregular compute and data access patterns, like those used for distributed sparse factorization.

You have full access to this open access chapter, Download conference paper PDF

Linear Systems Solvers for Distributed-Memory Machines with GPU Accelerators

H2Opus: a distributed-memory multi-GPU software package for non-local operators

Article 10 May 2022

Stefano Zampini, Wajih Boukaram, … David Keyes

Leading Edge Hybrid Multi-GPU Algorithms for Generalized Eigenproblems in Electronic Structure Calculations

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The solution of large sparse linear systems is central to many problems in science, engineering, and optimization. Direct methods, for which the major computational task is factoring the coefficient matrix, are often the solution method of choice due to their generality and robustness. Performance optimization, and particularly parallelization, of sparse factorization has been the subject of intensive research.

A growing portion of the computational capability of supercomputer clusters is now being provided by accelerators, particularly GPUs [22]. The characteristics of these distributed heterogeneous systems, most particularly the separate CPU and GPU memories and the limited bandwidth of the PCIe bus over which they communicate, makes leveraging their computational power a challenge for irregular algorithms such as sparse factorization. As a result, there is a need to adapt the implementation of sparse factorization algorithms to such heterogeneous architectures and there is active work in this area [6, 14, 19, 21, 23].

Previous work in GPU acceleration of sparse factorization has taken the approach of offloading dense math to the GPU and making accommodation for:

only sending appropriately sized work to the GPU,
asynchronous operations to overlap CPU computation, GPU computation and PCIe communication,
minimizing communication by re-using data on the GPU as much as possible.

Most previous work has been limited to single-node, shared-memory systems with the notable exception of [20], which showed performance on 8 nodes.

This work seeks GPU acceleration of sparse factorization on large heterogeneous, distributed systems. The current scope is limited to Cholesky factorization (\(A = LL^T\)) [8] as that is the simpler case, but we expect many of the techniques will be applicable to other factorization algorithms as well. While we leverage the techniques itemized immediately above, this implementation is novel in thatit:

accelerates the factorization in the Watson Spare Matrix Package (WSMP) [9] which shows leading performance on distributed systems [12, 13, 18],
uses a minimally-invasive approach promoting maintainability and portability of the underlying CPU code,
demonstrates improved performance that scales to an order of magnitude more nodes and cores than previous work, and
identifies techniques for achieving further performance improvements.

We describe the implementation in detail, present experimental results, analyze the results to gain insights about performance bottlenecks, and suggest concrete and feasible avenues for further performance improvement. We show that the use of GPUs can more than double the performance of WSMP’s sparse Cholesky factorization on up to 128 nodes of NCSA’s Blue Waters supercomputer [3] and can strong scale beneficially up to 512 nodes.

2 WSMP Cholesky Factorization

WSMP uses a highly scalable distributed-memory parallel sparse Cholesky factorization algorithm [11] based on the multifrontal method [5, 16]. A multifrontal algorithm expresses the entire sparse matrix factorization in terms of partial factorizations of smaller dense matrices called frontal matrices of the type illustrated in Fig. 1. The rows and columns of the sparse coefficient matrix A are divided into contiguous blocks called supernodes [2] to aid the construction of the dense frontal matrices. The control and data flow in a typical multifrontal algorithm follows a dependency graph known as the elimination tree [15], which can be computed inexpensively from the structure of a symmetric sparse matrix. Each vertex of the elimination tree corresponds to a supernode.

In WSMP’s distributed-memory parallel factorization utilizing p MPI processes, the supernodal tree is binary in the top \(\log _2 p\) levels. The portions of this binary supernodal tree are assigned to the processes using a subtree-to-subcube strategy [17] illustrated in the top two levels of the tree in Fig. 2. The frontal matrix of each supernode is distributed among a logical mesh of a subset of processes using a bit-mask based block-cyclic scheme [11]. The parallel partial factorization operation at each supernode is a pipelined implementation of dense block Cholesky factorization and update operations shown in Fig. 1. Most of the flops corresponding to a supernode are performed by Basic Linear Algebra Subprograms (BLAS) [4]. The three key operations are: (1) dense Cholesky on the top left matrix S, (2) the DTRSM operation \(T = S^{-1}T\), and (3) the DSYRK operation \(U^i = U^i - TT^T\). Hereafter, we will refer to a factorization task like this performed on multiple nodes as cooperative factorization.

Usually, each MPI process is multithreaded, and the portion of multifrontal factorization assigned to each process is further parallelized [10]. Just like the message-passing portion, tasks at each subroot of the elimination tree are assigned to independent groups of threads until each thread ends up with its own subtree. However, mapping between tasks and threads is more flexible than mapping between tasks and MPI processes. Furthermore, a strict block-cyclic mapping based on the binary representation of the indices is not used because all processors can access all rows and columns with the relatively small overhead. Hereafter, we will refer to factorization tasks performed on a single MPI rank as individual factorization.

Figure 2 shows one hypothetical mapping of an elimination tree among the 16 CPUs of a 4-node cluster with 4-way shared-memory parallel nodes. The symbolic computation preceding the numerical phase ensures that the tree is binary at the top \(\log _2 p\) levels. At each of these levels, multithreaded BLAS calls are used to utilize all the CPU’s on each node during cooperative factorization. In the individual factorization region of the tree, which lies below the top \(\log _2 p\) levels, the subtrees are assigned to groups of threads until they are mapped onto single threads.

3 Minimally Invasive Approach

When factorization involves sufficient large dense blocks, where the computation can be tiled and the communication costs hidden behind computation, then satisfactory acceleration can be achieved by offloading large level-3 BLAS computations to the GPU [21]. However, when dense blocks are fewer or smaller (as might occur in shell models), or at scale (when the dense blocks are distributed across many nodes), communication can’t be hidden and opportunity for easy GPU acceleration is limited. Further, for best PCIe performance, and to support transfers which are asynchronous or concurrent with GPU computation, the data on the host must reside in ‘pinned’ (page-locked) memory. Copying data in or out of pinned memory buffers adds to the communication overhead. This BLAS-offloading approach was investigated for use with WSMP, and as expected, did not yield significant benefit.

However, as WSMP is inherently multithreaded and performs multiple BLAS operations simultaneously in different threads, the same idea behind tiling a single large BLAS operation can be used to hide communication between many simultaneous small BLAS operations. That is, computation for a BLAS operation in one thread call can proceed simultaneously with PCIe communication for a BLAS operation in another thread and with host memory copy (in or out of pinned buffers) of a third. This parallel offload approach did not involve any changes to the base WSMP code and permitted GPU acceleration of the numerical factorization by a factor of 1.7 on average [7] on a single multicore node. Unfortunately, at scale, to promote load-balancing, WSMP decomposes BLAS calls into smaller blocks than can be accelerated in this fashion.

3.1 Dual Library Solution

Pursuing further optimization, it is clear from Fig. 1 that DTRSM is typically followed by DSYRK, which takes as input the output from DTRSM. As a result, a considerable amount of unnecessary, but expensive, communication remains in transferring submatrix T in and out of GPU memory. It is relatively simple to construct a library routine which performs both DTRSM and DSYRK on the GPU and eliminates the intervening communication. However, such an optimization would require high-level code changes to WSMP. Consequently, a scheme was devised that would not only help overcome the performance constraints at scale, but would also prove to be an elegant way of enabling a large complex existing software like WSMP to effectively harness the power of accelerators. This was implemented as a library (the ‘dual’ library, which we will call ACCEL_WSMP) which would contain all of the GPU-specific code accessed via high-level (BLAS or above) routines. Only small changes would need to be made to WSMP to call these routines, if available, at a high level. In this way, WSMP could be linked with a driver code and, if ACCEL_WSMP was not provided, it’s CPU-only behavior would be unchanged. However, if ACCEL_WSMP were provided during the linking, then GPU acceleration would be obtained. In this way, we consider the GPU acceleration to be ‘minimally-invasive’.

In addition to the acceleration techniques described so far, additional optimizations were to:

1.
limit the minimum tile size by which the problem can be decomposed to a size that still permits acceleration on the GPU,
2.
create high-level API which would provide parallel, tiled GPU BLAS operations and combined GPU BLAS operations to minimize communication,
3.
define a protocol for flagging matrices on the GPU for re-use (and later indicating when they can be deleted) to further minimize communication.

ACCEL_WSMP consists of a total of 10 functions for performing computations on dense matrices and/or moving them between host and device memories, and two administrative functions, including one-time initialization that performs the relatively expensive task of allocating pinned buffers. Note that all computational routines in ACCEL_WSMP have a CPU equivalent (pre-existing in WSMP). If the GPU is occupied (by other WSMP threads) or the operation is too small, then the computation remains on the CPU.

3.2 Individual Factorization

In the initial individual factorization phase of WSMP’s sparse Cholesky factorization, the computationally-intensive portion consists of DTRSM followed by symmetric rank-k update DSYRK, with DTRSM output used as input for DSYRK.

In Fig. 3, we show the benefit of caching the output matrix T from DTRSM for subsequent use in DSYRK. The figure shows performance of DTRSMSYRK routine on a single node of our target cluster (see Sect. 4) as a function of dimension k of the triangular matrix S. The number of rows of matrix T is set to an empirically derived value of 3k. Separately, performance achieved when DTRSM output is not cached in GPU memory (and is therefore reloaded for DSYRK) is shown. Caching appears to improve DTRSMSYRK performance by about 30 % over individual uncached DTRSM and DSYRK. GPU performance increases with k, with a slight decrease at \(k = 2048\) due to switch to tiled version of the algorithm. GPU performance overtakes the highest possible CPU performance at about \(k=384\), which is the cutoff that we use to determine if DTRSMSYRK operation is performed on the CPU or on the GPU.

3.3 Cooperative Factorization

In the cooperative phase of WSMP’s Cholesky factorization, the dense Cholesky panel factorization illustrated in Fig. 1 is performed with S, T, and U distributed among multiple nodes. The dense level-3 BLAS calls in this phase are DTRSM, DSYRK and matrix-matrix multiply (DGEMM). Generally, the caching strategy described in Sect. 3.2 is used to hide PCIe communication.

There are two additional challenges in this phase. First, unlike DTRSMSYRK, where DTRSM output could be immediately reused as DSYRK input, this reuse is spread over multiple DGEMM calls during cooperative factorization. To expose this data reuse, WSMP signals when a particular matrix is likely to be used across many DGEMM and DSYRK calls. Such flagging of reusable data is a key part of the interface between WSMP and ACCEL_WSMP libraries. ACCEL_WSMP attempts to retain flagged matrices until WSMP signals that a previously flagged matrix is no longer needed.

The second challenge is related to load balancing. In the cooperative factorization phase, WSMP distributes frontal matrices in a block-cyclic fashion among participating nodes in order to facilitate pipelined load-balanced dense panel factorization. The optimal block size needs to be large enough to take advantage of the level-3 BLAS operations, yet small enough to not cause excessive load imbalance in the pipeline. Typical block sizes used on the CPU are 64 or 128. This block size is the inner dimension k of DSYRK and DGEMM calls issued during cooperative factorization. When GPU acceleration is used, since data transfer must be hidden behind communication, the achieved performance is directly proportional to the inner dimension. As shown in Fig. 3, GPU performance lags the CPU performance for inner dimensions smaller than about 384. Therefore, a block size of 512 is used in order to achieve significant GPU acceleration, which increases load imbalance and negatively impacts scalability. The trade-offs involved in the selection of the block size are discussed in more detail in Sects. 5 and 6.

4 Testing Configuration

The matrices used in this study are extracted via a DMAP procedure [1] from a commercial FEA software NX Nastran (2012) from Siemens PLM. They represent industrial CAD geometries and are automatically meshed with 10-noded tetrahedral elements. Figure 4 shows the geometries used for this paper: a symmetric machine part cutter with asymmetrical loads, and a header part of a Charge Air Cooler (CAC) with complex geometry and whose elements have higher aspect ratios and therefore a higher condition number. For the cutter model, the element size control has produced different levels of automatic mesh refinement. Table 1 summarizes the test matrices.

Table 1. Test matrices

Full size table

The sustained peta-scale Blue Waters [3] system, hosted at the University of Illinois National Center for Supercomputing Applications (NCSA) was used in our experiments. Accelerated XK7 nodes of Blue Waters, each containing a single AMD Interlagos processor with 8 floating point cores (32 GB node memory) and a single NVIDIA Kepler K20X GPU (8 GB of GPU memory) were used. The Intel compiler version 2015 was used for compilation, and sequential Intel MKL for BLAS routines on the host. The CUDA runtime and cuBLAS libraries from the NVIDIA CUDA Toolkit version 5.5 were used for operations on the GPU. All runs used a single MPI rank per node and utilized all 8 cores on each node by using 8 threads per MPI rank.

Table 2. Numerical factorization times (seconds) of the default CPU code, CPU code with block size of 512, and GPU-enabled code with block size of 512.

Full size table

5 Scaling and Acceleration Results

The numerical factorization performance for each of the four matrices has been benchmarked against the number of XK7 nodes used, spanning from the minimum number of nodes necessary to accommodate the problem to where the GPU-accelerated performance drops below the CPU-only performance. Parallel speed-up is defined as the ratio of the wall clock time on one cluster node to the wall clock time on p nodes. However, since most matrices do not fit on one node, the speed-up is computed with respect to the performance on the minimum number of nodes that a problem fits on.

Figure 5 shows the parallel speed-up obtained both on the CPU and the GPU for the numerical factorization of the four test matrices. Solid blue and green lines show the CPU and GPU results, respectively. Plain dashed lines of the corresponding colors show ideal speed-up from the starting point of each curve. The actual Cholesky factorization times on which these speed-up curves are based are shown in Table 2.

In Sect. 3.3, it was noted that a block size of 512 was used in the cooperative factorization phase on the GPU runs. In CPU-only runs, WSMP automatically chooses the block size, which is typically 64 or 128. To estimate the effect of increased block size on load imbalance, and hence on performance, the CPU runs were also performed using a block size of 512 and the results are shown as black lines in Fig. 5. In all the cases, the performance penalty due to larger block size grows larger as the number of nodes increases. This happens because, in wider runs, more levels of the elimination tree are cooperatively factored, and the number of XK7 nodes among which the frontal matrices of a given size are distributed increases. This creates a longer pipeline of dense operations on smaller submatrices for a given size of a frontal matrix. A large block size results in fewer stages in the pipelines, and therefore greater load imbalance. Figure 5 also shows the projected performance (dotted red line) that could be achieved through further optimizations discussed in Sect. 6.

In all four cases, on the smaller number of nodes, the GPU-accelerated version shows significant performance gains (2.5–3.5x) vs. the CPU-only case. These gains become smaller for wider runs.

Figure 6 shows the acceleration achieved by the GPU-enabled library over the CPU-only version for the cutter and CAC matrices on the different numbers of nodes. This plot directly demonstrates the benefit of GPU-acceleration.

6 Performance Analysis

The results in Sect. 5 show significant GPU acceleration with respect to the CPU-only performance for complex problems on hundreds of nodes. In this section, we study the issues limiting further GPU acceleration and show that one of the key causes can be remedied to a large extent by a simple enhancement to our implementation.

We have already touched upon the need for using a large block size of 512 in the GPU implementation and the resulting load imbalance in cooperative factorization. Splitting submatrix T into column blocks results in same blocks of U being updated multiple times. Therefore, if ACCEL_WSMP can retain the result matrix of DGEMM in GPU memory, then the result matrix, U, could be transferred back to the CPU only once, after the last DGEMM. In this way, the total amount of PCIe traffic would become independent of the block size - in fact it would be the same as in the unblocked case, permitting effective use of smaller block sizes and improving both performance and load-balancing. The current implementation of ACCEL_WSMP includes the capability to cache the input matrices of DGEMM, but not the output.

To estimate the effect of caching U, we first calculated the time lost on the CPU due to the added load imbalance when running with a block size of 512. This is the difference between CPU-512 and CPU-default times (Table 2). We assume that this overhead is reflected in the GPU-512 time, but scaled by a factor related to GPU acceleration of the imbalanced computation. We used the ratio of CPU-512 to GPU-512 time on the smallest number of nodes that a matrix ran on to derive a conservative estimate of the acceleration factor. The time lost to load imbalance in the CPU-512 case is then divided by this factor and subtracted from the GPU-512 time to obtain the GPU-projected time and speedup. The results is shown by the dotted red lines (GPU-projected) in Fig. 5.

7 Conclusions

Sparse factorization is difficult to accelerate on heterogeneous clusters due to irregularity in computation and memory access and the limits of communication between the GPU and CPU. These issues are beyond the traditional cluster hurdles of problem decomposition and MPI communication.

In this work, it has been shown that with remarkably minor modifications to the original CPU code, the numerical factorization performance of WSMP can be accelerated on a large number of GPU-enabled nodes of a supercomputer. Our approach relied upon identifying computations that could be beneficially accelerated, defining an accelerator API at an appropriate level of abstraction to capture these computations, and designing a separate library to implement the API. Benchmarking on industrially-derived matrices shows that on 128 nodes, the GPU-accelerated code is about 2 times faster than the CPU-only code using the same number of XK7 nodes, but without using the GPUs. A high degree of acceleration, up to 3.5x, is observed for lower node counts. Due to issues with load-balancing and difficulty maintaining sufficient computational intensity to hide CPU\(\leftrightarrow \)GPU transfers, the observed speed-up is reduced at larger node counts. A way to overcome a significant portion of the performance loss on large numbers of nodes has been identified as caching U on the GPU.

The observed scaling represents a unique demonstration that GPUs can be effectively applied to sparse factorization on large clusters and has the potential to permit application of these GPU-accelerated clusters to new types of analysis.

A relatively unique feature of our approach is that performance portability across different accelerator platforms can be achieved by simply tuning or reimplementing the well-contained accelerator library.

References

NX Nastran User’s Manual, Version 8.0 (2012). http://support.industrysoftware.automation.siemens.com/general/nxn.shtml
Ashcraft, C., Grimes, R.G.: The influence of relaxed supernode partitions on the multifrontal method. ACM Trans. Math. Softw. 15(4), 291–309 (1989)
Article MATH Google Scholar
Bode, B., Butler, M., Dunning, T., Hoefler, T., Kramer, W., Gropp, W., mei Hwu, W.: The Blue Water Super-System for Super-Science, pp. 339–366. Chapman and Hall/CRC (2013) (02 Oct 2016)
Google Scholar
Dongarra, J.J., Croz, J.D., Hammarling, S., Duff, I.S.: A set of level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990)
Article MATH Google Scholar
Duff, I.S., Reid, J.K.: The multifrontal solution of indefinite sparse symmetric linear equations. ACM Trans. Math. Softw. 9(3), 302–325 (1983)
Article MathSciNet MATH Google Scholar
George, T., Saxena, V., Gupta, A., Singh, A., Choudhury, A.: Multifrontal factorization of sparse spd matrices on gpus. In: 2011 IEEE International Parallel Distributed Processing Symposium (IPDPS), pp. 372–383 (2011)
Google Scholar
Gimelshein, N.E., Gupta, A., Rennich, S.C., Koric, S.: GPU acceleration of WSMP. In GPU Technology Conference 2015. Nvidia Corp. (2015)
Google Scholar
Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. Johns Hopkins University Press, USA (1996)
MATH Google Scholar
Gupta, A.: WSMP: Watson sparse matrix package (Part-I: direct solution of symmetric sparse systems). IBM TJ Watson Research Center, Yorktown Heights, NY, Technical report, RC, 21886 (2000)
Google Scholar
Gupta, A.: A shared- and distributed-memory parallel sparse direct solver. In: Dongarra, J., Madsen, K., Waśniewski, J. (eds.) PARA 2004. LNCS, vol. 3732, pp. 778–787. Springer, Heidelberg (2006)
Google Scholar
Gupta, A., Karypis, G., Kumar, V.: Highly scalable parallel algorithms for sparse matrix factorization. IEEE Trans. Parallel Distrib. Syst. 8(5), 502–520 (1997)
Article Google Scholar
Gupta, A., Koric, S., George, T.: Sparse linear solvers on massively parallel machines. In: Proceedings of the ACM/IEEE Conference on High Performance Computing, SC 2009, Portland, Oregon, USA, 14–20 November. IEEE (2009)
Google Scholar
Koric, S., Lu, Q., Guleryuz, E.: Evaluation of massively parallel linear sparse solvers on unstructured finite element meshes. Comput. Struct. 141, 19–25 (2014)
Article Google Scholar
Lacoste, X., Faverge, M., Bosilca, G., Ramet, P., Thibault, S.: Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes. In: 2014 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW), pp. 29–38. IEEE (2014)
Google Scholar
Liu, J.W.-H.: The role of elimination trees in sparse factorization. SIAM J. Matrix Anal. Appl. 11, 134–172 (1990)
Article MathSciNet MATH Google Scholar
Liu, J.W.-H.: The multifrontal method for sparse matrix solution: theory and practice. SIAM Rev. 34(1), 82–109 (1992)
Article MathSciNet MATH Google Scholar
Mu, M., Rice, J.R.: A grid-based subtree-subcube assignment strategy for solving partial differential equations on hypercubes. SIAM J. Sci. Stat. Comput. 13(3), 826–839 (1992)
Article MathSciNet MATH Google Scholar
Puzyrev, V., Koric, S., Wilkin, S.: Evaluation of parallel direct sparse linear solvers in electromagnetic geophysical problems. Comput. Geosci. 89, 79–87 (2016)
Article Google Scholar
Rennich, S.C., Stosic, D., Davis, T.A.: Accelerating sparse Cholesky factorization on GPUs. In: Proceedings of the Fourth Workshop on Irregular Applications: Architectures and Algorithms, IA3 2014, pp. 9–16 (2014)
Google Scholar
Sao, P., Vuduc, R., Li, X.S.: A distributed CPU-GPU sparse direct solver. In: Silva, F., Dutra, I., Santos Costa, V. (eds.) Euro-Par 2014 Parallel Processing. LNCS, vol. 8632, pp. 487–498. Springer, Heidelberg (2014)
Google Scholar
Schenk, O., Christen, M., Burkhart, H.: Algorithmic performance studies on graphics processing units. J. Parallel Distrib. Comput. 68(10), 1360–1369 (2008)
Article Google Scholar
Strohmaier, E., Dongarra, J., Simon, H., Meuer, M.: The Top 500 list. http://www.top500.org
Yeralan, N., Davis, T.A., Ranka, S.: Sparse QR factorization on the GPU. Submission to ACM Trans. Math. Softw
Google Scholar

Download references

Acknowledgments

The authors would like to thank the Private Sector Program and the Blue Waters sustained-petascale computing project at NCSA. Blue Waters is supported by NSF awards OCI-0725070 and ACI-1238993, and by the state of Illinois.

Author information

Authors and Affiliations

IBM Research, Yorktown Heights, NY, USA
Anshul Gupta
NVIDIA Corporation, Santa Clara, CA, USA
Natalia Gimelshein & Steven Rennich
NCSA, University of Illinois, Urbana, IL, USA
Seid Koric

Authors

Anshul Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Natalia Gimelshein
View author publications
You can also search for this author in PubMed Google Scholar
Seid Koric
View author publications
You can also search for this author in PubMed Google Scholar
Steven Rennich
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anshul Gupta .

Editor information

Editors and Affiliations

Université Grenoble-Alpes , Grenoble, France
Pierre-François Dutot
Université Grenoble Alpes , Grenoble, France
Denis Trystram

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gupta, A., Gimelshein, N., Koric, S., Rennich, S. (2016). Effective Minimally-Invasive GPU Acceleration of Distributed Sparse Matrix Factorization. In: Dutot, PF., Trystram, D. (eds) Euro-Par 2016: Parallel Processing. Euro-Par 2016. Lecture Notes in Computer Science(), vol 9833. Springer, Cham. https://doi.org/10.1007/978-3-319-43659-3_49

Download citation

DOI: https://doi.org/10.1007/978-3-319-43659-3_49
Published: 09 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43658-6
Online ISBN: 978-3-319-43659-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics