Skip to main content
Log in

Integrating software and hardware hierarchies in an autotuning method for parallel routines in heterogeneous clusters

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

A hierarchical approach for autotuning linear algebra routines on heterogeneous platforms is presented. Hierarchy helps to alleviate the difficulties of tuning parallel routines for high-performance computing systems. This paper analyzes the application of the hierarchical approach at both the hardware and software levels, using the basic matrix multiplication and the Strassen multiplication as proof of concept on multicore+coprocessor nodes. In this way, the hierarchical approach allows partial delegation of the efficient exploitation of the computing units in the node to the underlying direct autotuned matrix multiplication used in the base case.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Agullo E, Demmel J, Dongarra J, Hadri B, Kurzak J, Langou J, Ltaief H, Luszczek P, Tomov S (2009) Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J Phys: Conf Ser 180(1):012037

    Google Scholar 

  2. Ansel J, Kamil S, Veeramachaneni K, Ragan-Kelley J, Bosboom J, O’Reilly U-M, Amarasinghe S (2014) OpenTuner: An extensible framework for program autotuning. In: 23rd International Conference on Parallel Architectures and Compilation Techniques. Edmonton, Canada, ACM, pp 303–316

  3. Augonnet C, Thibault S, Namyst R, Wacrenier P-A (2011) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput: Pract Exp 23(2):187–198

    Article  Google Scholar 

  4. Batory D (1992) The design and implementation of hierarchical software systems with reusable components. ACM Trans Softw Eng Methodol 1:355–398

    Article  Google Scholar 

  5. Bernabé G, Cuenca J, García L-P, Giménez D (2015) Auto-tuning techniques for linear algebra routines on hybrid platforms. J Comput Sci 10:299–310

    Article  MathSciNet  Google Scholar 

  6. Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, Dongarra JJ, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK user’s guide. Society for Industrial and Applied Mathematics, Philadelphia

    Book  Google Scholar 

  7. Cámara J, Cuenca J, Giménez D (2019) Hierarchical automatic optimization of high and medium level linear algebra routines. In: 18th International Conference on Computational and Mathematical Methods in Science and Engineering

  8. Chameleon: Dense linear algebra subroutines for heterogeneous and distributed architectures. https://gitlab.inria.fr/solverstack/chameleon. Accessed Sept 2019

  9. cuBLAS. http://docs.nvidia.com/cuda/cublas/. Accessed Sept 2019

  10. Cuenca J, García L-P, Giménez D, Herrera F-J (2017) Guided installation of basic linear algebra routines in a cluster with manycore components. Concurr Comput: Pract Exp 29(15):e4112

    Article  Google Scholar 

  11. Dackland K, Kågström B (1996) A hierarchical approach for performance analysis of ScaLAPACK-based routines using the distributed linear algebra machine. In: Applied Parallel Computing, Industrial Computation and Optimization, Third International Workshop, PARA96. Lyngby, Denmark, pp 186–195

  12. Fatica M (2009) Accelerating Linpack with CUDA on heterogenous clusters. In: 2nd Workshop on General Purpose Processing on Graphics Processing Units. NY, USA, ACM, New York, pp 46–51

  13. Golub G, Van Loan CF (2013) Matrix computations, 4th edn. The John Hopkins University Press, Baltimore

    MATH  Google Scholar 

  14. Goto K, van de Geijn RA (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):12:1–12:25

    Article  MathSciNet  Google Scholar 

  15. Hasanov K, Quintin J-N, Lastovetsky AL (2015) Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms. J Supercomput 71(11):3991–4014

    Article  Google Scholar 

  16. Intel MKL. http://software.intel.com/en-us/intel-mkl/. Accessed Sept 2019

  17. Ohshima S, Kise K, Katagiri T, Yuba T (2007) Parallel processing of matrix multiplication in a CPU and GPU heterogeneous environment. In: 7th International Conference on High Performance Computing for Computational Science. Springer-Verlag, pp 305–318

  18. Pfaffe P, Grosser T, Tillmann M (2019) Efficient hierarchical online-autotuning: A case study on polyhedral accelerator mapping. In: Proceedings of the ACM International Conference on Supercomputing, ICS ’19, New York, USA, ACM, pp 354–366

  19. PLASMA. http://icl.cs.utk.edu/plasma/. Accessed Sept 2019

  20. Porterfield A, Bhalachandra S, Wang W, Fowler R (2016) Variability: a tuning headache. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 1069–1072

  21. Stanisic L, Thibault S, Legrand A, Videau B, Méhaut J-F (2015) Faithful performance prediction of a dynamic task-based runtime system for heterogeneous multi-core architectures. Concurr Comput: Pract Exp 27(16):4075–4090

    Article  Google Scholar 

  22. Williams S, Oliker L, Carter J, Shalf J (2011) Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, New York, USA, ACM, pp 1–12

  23. Yokota R, Barba L (2012) Hierarchical N-body simulations with autotuning for heterogeneous systems. Comput Sci Eng 14(3):30–39

    Article  Google Scholar 

  24. Zhong Z, Rychkov V, Lastovetsky AL (2015) Data partitioning on multicore and multi-GPU platforms using functional performance models. IEEE Trans Comput 64(9):2506–2518

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was supported by the Spanish MCIU and AEI, as well as European Commission FEDER funds, under Grant RTI2018-098156-B-C53.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Javier Cuenca.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cámara, J., Cuenca, J. & Giménez, D. Integrating software and hardware hierarchies in an autotuning method for parallel routines in heterogeneous clusters. J Supercomput 76, 9922–9941 (2020). https://doi.org/10.1007/s11227-020-03235-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-020-03235-9

Keywords

Navigation