Skip to main content

Advertisement

Log in

LU factorization on heterogeneous systems: an energy-efficient approach towards high performance

  • Published:
Computing Aims and scope Submit manuscript

Abstract

Dense lower–upper (LU) factorization (hereafter referred to as LU) is a critical kernel that is widely used to solve dense linear algebra problems. Hybrid LU algorithms have been well designed to exploit the full capacity of heterogeneous systems. However, existing heterogeneous implementations are typically CPU-centric, which rely highly on CPU cores and suffer from a large amount of data transfers via the PCIe bus, and thus reduce the overall energy efficiency of the entire computer system. In this paper, we provide a coprocessor-resident implementation of LU for a heterogeneous platform to improve energy efficiency by relieving the CPUs from performing heavy load computations and avoiding excessive data transfers via PCIe. To maintain the performance, we conduct optimizations to pipeline the CPU computation, coprocessor computation, MPI communication, and PCIe transfer between the CPUs and coprocessors. The experiments on the Tianhe-2 supercomputer show that our LU implementation can compete with the highly optimized Intel MKL implementation in performance and overcome the limitations of energy efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Luciani X, Albera L (2015) Joint eigenvalue decomposition of non-defective matrices based on the LU factorization with application to ICA. IEEE Trans Signal Process 63(17):1

    Article  MathSciNet  Google Scholar 

  2. Petitet A, Whaley RC, Dongarra J, Cleary A (2004) HPL-a portable implementation of the high-performance linpack benchmark for distributed-memory computers. http://www.netlib.org/benchmark/hpl/

  3. http://www.top500.org

  4. Castaldo AM, Clint Whaley R, Samuel S (2010) Scaling LAPACK panel operations using parallel cache assignment. ACM Trans Math Softw 45(5):223–232

    MATH  Google Scholar 

  5. Xu W, Lu Y, Li Q, Zhou E, Song Z, Dong Y, Zhang W (2014) Hybrid hierarchy storage system in MilkyWay-2 supercomputer. Front Comput Sci 8(3):367–377

  6. Kogge P, Borkar S, Dan C, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hiller J, Stephen K (2008) Exascale computing study: technology challenges in achieving exascale systems. DARPA Information Processing Techniques Office

  7. Heinecke A, Vaidyanathan K, Smelyanskiy M, Kobotov A, Dubtsov R, Henry G, Shet AG, Chrysos G, Dubey P (2013) Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel Xeon Phi coprocessor. In: 2013 IEEE 27th international symposium on parallel and distributed processing (IPDPS), pp 126–137

  8. Fatica M (2009) Accelerating linpack with CUDA on heterogenous clusters. In: Proceedings of 2nd workshop on general purpose processing on graphics processing units, GPGPU-2, pp 46–51

  9. Endo T, Matsuoka S, Nukada A, Maruyama N (2010) Linpack evaluation on a supercomputer with heterogeneous accelerators. In: 2010 IEEE international symposium on parallel and distributed processing (IPDPS), pp 1–8

  10. Jo Gangwon, Nah Jeongho, Lee Jun, Kim Jungwon, Lee Jaejin (2015) Accelerating LINPACK with MPI-OpenCL on clusters of multi-GPU nodes. IEEE Trans Parallel Distrib Syst 26:1

    Article  Google Scholar 

  11. Wang F, Yang CQ, Du YF, Chen J, Yi HZ, Xu WX (2011) Optimizing linpack benchmark on GPU-accelerated petascale supercomputer. J Comput Sci Technol 26(5):854–865

    Article  Google Scholar 

  12. Kurzak J, Luszczek P, Faverge M, Dongarra J (2013) LU factorization with partial pivoting for a multicore system with accelerators. IEEE Trans Parallel Distrib Syst 24(24):1613–1621

    Article  Google Scholar 

  13. Deisher M, Smelyanskiy M, Nickerson B, Lee VW, Chuvelev M, Dubey P (2011) Designing and dynamically load balancing hybrid LU for multi/many-core. Comput Sci Res Dev 26(3–4):211–220

    Article  Google Scholar 

  14. Chen X, Chang LW, Rodrigues CI, Lv J, Wang Z, Hwu WM (2015) Adaptive cache management for energy-efficient GPU computing. In: Proceedings of the 47th annual IEEE/ACM international symposium on microarchitecture, pp 343–355

  15. Dongarra JJ, Duff LS, Sorensen DC, Vander Vorst HA (1998) Numerical linear algebra for high-performance computers. Society for Industrial and Applied Mathematics, Siam

  16. Gustavson FG (1997) Recursion leads to automatic variable blocking for dense liner algebra algorithms. IBM J Res Dev 41(6):737–755

    Article  Google Scholar 

  17. Van De Velde EF (1990) Experiments with multicomputer LU-decomposition. Concurr Pract Exper 2(1):1–6

    Article  Google Scholar 

  18. Fox GC, Johnson MA, Lyzenga GA, Otto SW, Salmon JK, Walker DW (1988) Solving problems on concurrent processors. Vol. 1: general techniques and regular problems, Prentice Hall, Old Tappan

  19. Hipes PG, Kuppermann A (1989) Gauss–Jordan inversion with pivoting on the caltech mark ii hypercube. In: Hypercube concurrent computers and applications, pp 1621–1634

  20. Bach M, Kretz M, Lindenstruth V, Rohr D (2011) Optimized HPL for AMD GPU and multi-core CPU usage. Comput Sci Res Dev 26(3):153–164

    Article  Google Scholar 

  21. Michael K, Gunnels J, Brokenshire D, Benton B (2009) Petascale computing with accelerators. In: Proceedings of the 14th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP’09, pp 241–250

  22. Dongarra J, Gates M, Haidar A, Jia Y, Kabir K, Luszczek P, Tomov S (2013) Portable HPC programming on intel many-integrated-core hardware with MAGMA Port to Xeon Phi. In: International conference on parallel processing and applied mathematics, Springer, pp 571–581

  23. Beckingsale D, Gaudin W, Herdman A, Jarvis S (2015) Resident block-structured adaptive mesh refinement on thousands of graphics processing units. In: 2015 44th international conference on parallel processing (ICPP), pp 61–70

  24. Tan L, Kothapalli S, Chen L, Hussaini O, Bissiri R, Chen Z (2014) A survey of power and energy efficient techniques for high performance numerical linear algebra operations. In: Parallel Comput, December 2014

  25. Haidar A, Dong T, Luszczek P, Tomov S, Dongarra J (2015) Optimization for performance and energy for batched matrix computations on GPUs. In: Proceedings of the 8th workshop on general purpose processing uGPUs, GPGPU-8, pp 59–69

  26. Haidar A, Dong T, Tomov S, Luszczek P, Dongarra J (2015) Framework for batched and gpu-resident factorization algorithms to block householder transformations. In: ISC high performance, pp 07–25

  27. Liu C, Li J, Huang W, Rubio J, Speight E, Lin X (2012) Power-efficient time-sensitive mapping in heterogeneous systems. In: Proceedings of the 21st international conference on parallel architectures and compilation techniques, PACT ’12, pp 23–32

  28. Hong S, Kim H (2010) An integrated gpu power and performance model. In: Proceedings of the 37th annual international symposium on computer architecture, ISCA ’10, pp 280–289

  29. Alonso P, Dolz MF, Igual FD, Mayo R, Quintana-Ort ES (2012) Reducing energy consumption of dense linear algebra operations on hybrid CPU–GPU platforms. In: 2012 IEEE 10th international symposium on parallel and distributed processing with applications, pp 56–62

  30. Intel Math Kernel Library (Intel MKL)

Download references

Acknowledgements

This work is supported by the National High Technology R&D Program of China (863 Program) 2015AA01A301, and the National Natural Science Foundation of China (NSFC) 61402488, 61602501.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cheng Chen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, C., Fang, J., Tang, T. et al. LU factorization on heterogeneous systems: an energy-efficient approach towards high performance. Computing 99, 791–811 (2017). https://doi.org/10.1007/s00607-016-0537-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-016-0537-2

Keywords

Mathematics Subject Classification

Navigation