Performance Engineering and Energy Efficiency of Building Blocks for Large, Sparse Eigenvalue Computations on Heterogeneous Supercomputers

Kreutzer, Moritz; Thies, Jonas; Pieper, Andreas; Alvermann, Andreas; Galgon, Martin; Röhrig-Zöllner, Melven; Shahzad, Faisal; Basermann, Achim; Bishop, Alan R.; Fehske, Holger; Hager, Georg; Lang, Bruno; Wellein, Gerhard

doi:10.1007/978-3-319-40528-5_14

Moritz Kreutzer¹⁰,
Jonas Thies¹³,
Andreas Pieper¹¹,
Andreas Alvermann¹¹,
Martin Galgon¹²,
Melven Röhrig-Zöllner¹³,
Faisal Shahzad¹⁰,
Achim Basermann¹³,
Alan R. Bishop¹⁴,
Holger Fehske¹¹,
Georg Hager¹⁰,
Bruno Lang¹² &
…
Gerhard Wellein¹⁰

Part of the book series: Lecture Notes in Computational Science and Engineering ((LNCSE,volume 113))

941 Accesses

Abstract

Numerous challenges have to be mastered as applications in scientific computing are being developed for post-petascale parallel systems. While ample parallelism is usually available in the numerical problems at hand, the efficient use of supercomputer resources requires not only good scalability but also a verifiably effective use of resources on the core, the processor, and the accelerator level. Furthermore, power dissipation and energy consumption are becoming further optimization targets besides time-to-solution. Performance Engineering (PE) is the pivotal strategy for developing effective parallel code on all levels of modern architectures. In this paper we report on the development and use of low-level parallel building blocks in the GHOST library (“General, Hybrid, and Optimized Sparse Toolkit”). We demonstrate the use of PE in optimizing a density of states computation using the Kernel Polynomial Method, and show that reduction of runtime and reduction of energy are literally the same goal in this case. We also give a brief overview of the capabilities of GHOST and the applications in which it is being used successfully.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ESSEX: Equipping Sparse Solvers For Exascale

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

Article 01 October 2016

Characterizing the efficiency of multicore and manycore processors for the solution of sparse linear systems

Article 15 September 2015

Notes

1.
http://www.energy.gov/downloads/fact-sheet-collaboration-oak-ridge-argonne-and-livermore-coral
2.
https://www.olcf.ornl.gov/summit/
3.
https://www.alcf.anl.gov/articles/introducing-aurora
4.
http://blogs.fau.de/essex
5.
https://bitbucket.org/essex
6.
The term “MPI+X” denotes the combination of the Message Passing Interface (MPI) and a node-level programming model.
7.
https://bitbucket.org/essex/
8.
http://www.mcs.anl.gov/petsc/features/gpus.html, accessed 02-16-2016
9.
https://www.lrz.de/services/compute/supermuc/systemdescription/

References

Ashari, A., Sedaghati, N., Eisenlohr, J., Parthasarathy, S., Sadayappan, P.: Fast sparse matrix-vector multiplication on GPUs for graph applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’14), pp. 781–792. IEEE Press, Piscataway (2014)
Google Scholar
Baker, C.G., Hetmaniuk, U.L., Lehoucq, R.B., Thornquist, H.K.: Anasazi software for the numerical solution of large-scale eigenvalue problems. ACM Trans. Math. Softw. 36 (3), 13:1–13:23 (2009)
Google Scholar
Balay, S., Abhyankar, S., Adams, M.F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Eijkhout, V., Gropp, W.D., Kaushik, D., Knepley, M.G., McInnes, L.C., Rupp, K., Smith, B.F., Zampini, S., Zhang, H.: PETSc Web page (2015). http://www.mcs.anl.gov/petsc
Google Scholar
Callahan, D., Cocke, J., Kennedy, K.: Estimating interlock and improving balance for pipelined architectures. J. Parallel Distrib. Commun. 5 (4), 334–358 (1988)
Article Google Scholar
Daga, M., Greathouse, J.L.: Structural agnostic spmv: Adapting csr-adaptive for irregular matrices. In: 2015 IEEE 22nd International Conference on High Performance Computing (HiPC), pp. 64–74 (2015)
Google Scholar
De Vogeleer, K., Memmi, G., Jouvelot, P., Coelho, F.: The energy/frequency convexity rule: modeling and experimental validation on mobile devices. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds.) Parallel Processing and Applied Mathematics. Lecture Notes in Computer Science, vol. 8384, pp. 793–803. Springer, Berlin/Heidelberg (2014)
Chapter Google Scholar
Duff, I.S., Heroux, M.A., Pozo, R.: An overview of the sparse basic linear algebra subprograms: the new standard from the BLAS technical forum. ACM Trans. Math. Softw. 28 (2), 239–267 (2002)
Article MathSciNet MATH Google Scholar
Fehske, H., Hager, G., Pieper, A.: Electron confinement in graphene with gate-defined quantum dots. Phys. Status Solidi 252 (8), 1868–1871 (2015)
Article Google Scholar
Greathouse, J.L., Daga, M.: Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 769–780 (SC ’14). IEEE Press, Piscataway (2014)
Google Scholar
Hager, G., Treibig, J., Habich, J., Wellein, G.: Exploring performance and power properties of modern multi-core chips via simple machine models. Concurr. Comput. 28 (2), 189–210 (2014)
Article Google Scholar
Heroux, M.A., Bartlett, R.A., Howle, V.E., Hoekstra, R.J., Hu, J.J., Kolda, T.G., Lehoucq, R.B., Long, K.R., Pawlowski, R.P., Phipps, E.T., Salinger, A.G., Thornquist, H.K., Tuminaro, R.S., Willenbring, J.M., Williams, A., Stanley, K.S.: An overview of the Trilinos project. ACM Trans. Math. Softw. 31 (3), 397–423 (2005)
Article MathSciNet MATH Google Scholar
Hockney, R.W., Curington, I.J.: f _1∕2: A parameter to characterize memory and communication bottlenecks. Parallel Comput. 10 (3), 277–286 (1989)
Google Scholar
Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36 (5), C401–C423 (2014)
Article MathSciNet MATH Google Scholar
Kreutzer, M., Pieper, A., Alvermann, A., Fehske, H., Hager, G., Wellein, G., Bishop, A.R.: Efficient large-scale sparse eigenvalue computations on heterogeneous hardware. In: Poster at 2015 ACM/IEEE International Conference on High Performance Computing Networking, Storage and Analysis (SC ’15) (2015)
Google Scholar
Kreutzer, M., Pieper, A., Hager, G., Alvermann, A., Wellein, G., Fehske, H.: Performance engineering of the kernel polynomial method on large-scale CPU-GPU systems. In: 29th IEEE International Parallel & Distributed Processing Symposium (IEEE IPDPS 2015), Hyderabad (2015)
Google Scholar
Kreutzer, M., Thies, J., Röhrig-Zöllner, M., Pieper, A., Shahzad, F., Galgon, M., Basermann, A., Fehske, H., Hager, G., Wellein, G.: GHOST: building blocks for high performance sparse linear algebra on heterogeneous systems (2015), preprint. http://arxiv.org/abs/1507.08101
Lawson, C.L., Hanson, R.J., Kincaid, D.R., Krogh, F.T.: Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Softw. 5 (3), 308–323 (1979)
Article MATH Google Scholar
LIKWID: Performance monitoring and benchmarking suite. https://github.com/RRZE-HPC/likwid/. Accessed Feb 2016
Liu, W., Vinter, B.: CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In: Proceedings of the 29th ACM on International Conference on Supercomputing (ICS ’15), pp. 339–350. ACM, New York (2015)
Google Scholar
MAGMA: Matrix algebra on GPU and multicore architectures. http://icl.cs.utk.edu/magma/. Accessed Feb 2016
Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix-vector multiplication for GPU architectures. In: Patt, Y., Foglia, P., Duesterwald, E., Faraboschi, P., Martorell, X. (eds.) High Performance Embedded Architectures and Compilers. Lecture Notes in Computer Science, vol. 5952, pp. 111–125. Springer, Berlin/Heidelberg (2010)
Chapter Google Scholar
Pieper, A., Heinisch, R.L., Fehske, H.: Electron dynamics in graphene with gate-defined quantum dots. EPL 104 (4), 47010 (2013)
Article Google Scholar
Pieper, A., Heinisch, R.L., Wellein, G., Fehske, H.: Dot-bound and dispersive states in graphene quantum dot superlattices. Phys. Rev. B 89, 165121 (2014)
Article Google Scholar
Pieper, A., Kreutzer, M., Galgon, M., Alvermann, A., Fehske, H., Hager, G., Lang, B., Wellein, G.: High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations (2015), preprint. http://arxiv.org/abs/1510.04895
Pieper, A., Schubert, G., Wellein, G., Fehske, H.: Effects of disorder and contacts on transport through graphene nanoribbons. Phys. Rev. B 88, 195409 (2013)
Article Google Scholar
Pieper, A., Fehske, H.: Topological insulators in random potentials. Phys. Rev. B 93, 035123 (2016)
Article Google Scholar
Polizzi, E.: Density-matrix-based algorithm for solving eigenvalue problems. Phys. Rev. B 79, 115112 (2009)
Article Google Scholar
Röhrig-Zöllner, M., Thies, J., Kreutzer, M., Alvermann, A., Pieper, A., Basermann, A., Hager, G., Wellein, G., Fehske, H.: Performance of block Jacobi-Davidson eigensolvers. In: Poster at 2014 ACM/IEEE International Conference on High Performance Computing Networking, Storage and Analysis (2014)
Google Scholar
Röhrig-Zöllner, M., Thies, J., Kreutzer, M., Alvermann, A., Pieper, A., Basermann, A., Hager, G., Wellein, G., Fehske, H.: Increasing the performance of the Jacobi–Davidson method by blocking. SIAM J. Sci. Comput. 37 (6), C697–C722 (2015)
Article MathSciNet MATH Google Scholar
Rupp, K., Rudolf, F., Weinbub, J.: ViennaCL – a high level linear algebra library for GPUs and multi-core CPUs. In: International Workshop on GPUs and Scientific Applications, pp. 51–56 (2010)
Google Scholar
Stengel, H., Treibig, J., Hager, G., Wellein, G.: Quantifying performance bottlenecks of stencil computations using the execution-cache-memory model. In: Proceedings of the 29th ACM International Conference on Supercomputing (ICS ’15), pp. 207–216. ACM, New York (2015)
Google Scholar
Thies, J., Galgon, M., Shahzad, F., Alvermann, A., Kreutzer, M., Pieper, A., Röhrig-Zöllner, M., Basermann, A., Fehske, H., Hager, G., Lang, B., Wellein, G.: Towards an exascale enabled sparse solver repository. In: Proceedings of SPPEXA Symposium. Lecture Notes in Computational Science and Engineering. Springer (2016)
Google Scholar
TOP500 Supercomputer Sites. http://www.top500.org. Accessed Feb 2016
Treibig, J., Hager, G., Wellein, G.: LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. In: Proceedings of the 2010 39th International Conference on Parallel Processing Workshops (ICPPW ’10), pp. 207–216. IEEE Computer Society, Washington, DC (2010)
Google Scholar
Treibig, J., Hager, G., Wellein, G.: likwid-bench: An extensible microbenchmarking platform for x86 multicore compute nodes. In: Brunst, H., Müller, M.S., Nagel, W.E., Resch, M.M. (eds.) Tools for High Performance Computing 2011, pp. 27–36. Springer, Berlin/Heidelberg (2012)
Chapter Google Scholar
Weiße, A., Wellein, G., Alvermann, A., Fehske, H.: The kernel polynomial method. Rev. Mod. Phys. 78, 275–306 (2006)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The research reported here was funded by Deutsche Forschungsgemeinschaft via the priority program 1648 “Software for Exascale Computing” (SPPEXA). The authors gratefully acknowledge support by the Gauss Centre for Supercomputing e.V. (GCS) for providing computing time on their SuperMUC system at Leibniz Supercomputing Centre through project pr84pi, and by the CSCS Lugano for providing access to their Piz Daint supercomputer. Work at Los Alamos is performed under the auspices of the USDOE.

Author information

Authors and Affiliations

Erlangen Regional Computing Center, Friedrich-Alxander-University Erlangen-Nuremberg, Erlangen, Germany
Moritz Kreutzer, Faisal Shahzad, Georg Hager & Gerhard Wellein
Institute of Physics, Ernst-Moritz-Arndt-Universität Greifswald, Greifswald, Germany
Andreas Pieper, Andreas Alvermann & Holger Fehske
Bergische Universität Wuppertal, Wuppertal, Germany
Martin Galgon & Bruno Lang
German Aerospace Center (DLR), Simulation and Software Technology, Köln, Germany
Jonas Thies, Melven Röhrig-Zöllner & Achim Basermann
Theory, Simulation and Computation Directorate, Los Alamos National Laboratory, Los Alamos, NM, USA
Alan R. Bishop

Authors

Moritz Kreutzer
View author publications
You can also search for this author in PubMed Google Scholar
Jonas Thies
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Pieper
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Alvermann
View author publications
You can also search for this author in PubMed Google Scholar
Martin Galgon
View author publications
You can also search for this author in PubMed Google Scholar
Melven Röhrig-Zöllner
View author publications
You can also search for this author in PubMed Google Scholar
Faisal Shahzad
View author publications
You can also search for this author in PubMed Google Scholar
Achim Basermann
View author publications
You can also search for this author in PubMed Google Scholar
Alan R. Bishop
View author publications
You can also search for this author in PubMed Google Scholar
Holger Fehske
View author publications
You can also search for this author in PubMed Google Scholar
Georg Hager
View author publications
You can also search for this author in PubMed Google Scholar
Bruno Lang
View author publications
You can also search for this author in PubMed Google Scholar
Gerhard Wellein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Moritz Kreutzer .

Editor information

Editors and Affiliations

Technische Universität München Institut für Informatik, Garching, Bayern, Germany
Hans-Joachim Bungartz
Technische Universität München Institut für Informatik, Garching, Germany
Philipp Neumann
Technische Universität Dresden, Dresden, Germany
Wolfgang E. Nagel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kreutzer, M. et al. (2016). Performance Engineering and Energy Efficiency of Building Blocks for Large, Sparse Eigenvalue Computations on Heterogeneous Supercomputers. In: Bungartz, HJ., Neumann, P., Nagel, W. (eds) Software for Exascale Computing - SPPEXA 2013-2015. Lecture Notes in Computational Science and Engineering, vol 113. Springer, Cham. https://doi.org/10.1007/978-3-319-40528-5_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-40528-5_14
Published: 15 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40526-1
Online ISBN: 978-3-319-40528-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics