Abstract
This work presents several self-optimization strategies to improve the performance of task-based linear algebra software on heterogeneous systems. The study focuses on Chameleon, a task-based dense linear algebra software whose routines are computed using a tile-based algorithmic scheme and executed in the available computing resources of the system using a scheduler which dynamically handles data dependencies among the basic computational kernels of each linear algebra routine. The proposed strategies are applied to select the best values for the parameters that affect the performance of the routines, such as the tile size or the scheduling policy, among others. Also, parallel optimized implementations provided by existing linear algebra libraries, such as Intel MKL (on multicore CPU) or cuBLAS (on GPU) are used to execute each of the computational kernels of the routines. Results obtained on a heterogeneous system composed of several multicore and multiGPU are satisfactory, with performances close to the experimental optimum.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agullo, E., Augonnet, C., Dongarra, J., Ltaief, H., Namyst, R., Thibault, S.: Faster, cheaper, better - a hybridization methodology to develop linear algebra software for GPUs. In: GPU Computing Gems, vol. 2 (2010)
Agullo, E., et al.: Achieving high performance on supercomputers with a sequential task-based programming model. IEEE Trans. Parallel Distrib. Syst. (2017)
Agullo, E., Cámara, J., Cuenca, J., Giménez, D.: On the autotuning of task-based numerical libraries for heterogeneous architectures. In: Advances in Parallel Computing, vol. 36, pp. 157–166 (2020)
Anzt, H., Haugen, B., Kurzak, J., Luszczek, P., Dongarra, J.: Experiences in autotuning matrix multiplication for energy minimization on GPUs. Concurr. Comput. Pract. Exp. 27, 5096–5113 (2015)
Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: STARPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exp. 23(2), 187–198 (2011)
Bilmes, J., Asanovic, K., Chin, C.W., Demmel, J.: Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: Proceedings of the 11th International Conference on Supercomputing, pp. 340–347 (1997)
Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput. 35, 38–53 (2009)
Cámara, J., Cuenca, J., García, L.P., Giménez, D.: Empirical modelling of linear algebra shared-memory routines. Procedia Comput. Sci. 18, 110–119 (2013)
Cámara, J., Cuenca, J., Giménez, D.: Integrating software and hardware hierarchies in an autotuning method for parallel routines in heterogeneous clusters. J. Supercomput. 76(12), 9922–9941 (2020). https://doi.org/10.1007/s11227-020-03235-9
Intel MKL. http://software.intel.com/en-us/intel-mkl/
Kelefouras, V., Kritikakou, A., Goutis, C.: A matrix-matrix multiplication methodology for single/multi-core architectures using SIMD. J. Supercomput. 68(3), 1418–1440 (2014)
Low, T.M., Igual, F.D., Smith, T.M., Quintana-Orti, E.S.: Analytical modeling is enough for high-performance BLIS. ACM Trans. Math. Softw. 43(2), 1–18 (2016)
NVIDIA cuBLAS. https://docs.nvidia.com/cuda/cublas/index.html
Stanisic, L., Thibault, S., Legrand, A.: Faithful performance prediction of a dynamic task-based runtime system for heterogeneous multi-core architectures. Concurr. Comput. Pract. Exp. 27, 4075–4090 (2015)
Tomov, S., Dongarra, J.: Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput. 36, 232–240 (2010)
Videau, B., et al.: BOAST: a metaprogramming framework to produce portable and efficient computing kernels for HPC applications. Int. J. High Perform. Comput. Appl. 32, 28–44 (2018)
Whaley, C., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27, 3–35 (2001)
Acknowledgements
Grant RTI2018-098156-B-C53 funded by MCIN/AEI and by “ERDF A way of making Europe”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Cámara, J., Cuenca, J., Boratto, M. (2023). Improving the Performance of Task-Based Linear Algebra Software with Autotuning Techniques on Heterogeneous Architectures. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 14073. Springer, Cham. https://doi.org/10.1007/978-3-031-35995-8_47
Download citation
DOI: https://doi.org/10.1007/978-3-031-35995-8_47
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-35994-1
Online ISBN: 978-3-031-35995-8
eBook Packages: Computer ScienceComputer Science (R0)