Algorithm and Software Overhead: A Theoretical Approach to Performance Portability

Mele, Valeria; Laccetti, Giuliano

doi:10.1007/978-3-031-30445-3_8

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13827))

Included in the following conference series:

International Conference on Parallel Processing and Applied Mathematics

415 Accesses
1 Citations

Abstract

In the last years, the portability term has enriched itself with new meanings: research communities are talking about how to measure the degree to which an application (or library, programming model, algorithm implementation, etc.) has become “performance portable”. The term “performance portability” has been informally used in computing communities to substantially refer to: (1) the ability to run one application across multiple hardware platforms; and (2) achieving some decent level of performance on these platforms [1, 2]. Among the efforts related to the “performance portability” issue, we note the annual performance portability workshops organized by the US Department of Energy [3]. This article intends to add a new point of view to the performance portability issue, starting from a more theoretical point of view, that shows the convenience of splitting the proper algorithm from the emphoverhead, and exploring the different factors that introduce different kind of overhead. The paper explores the theoretical framework to get a definition of the execution time of a software but that definition is not the point. The aim is to show and understand the link between that execution time and the beginning of the design, to exploit what part of any program is really environment-sensitive and exclude from performance portability formulas everything is not going to change, as theoretically shown.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Decomposition matrix is the name we preferred in this work, but in [8] it is referred as dependency matrix.
2.
These can be basic operations (arithmetic,\(\ldots \)), special functions evaluations (\(\sin ,\cos ,\ldots \)), solvers (integrals, equations system, non-linear equations\(\ldots \)).
3.
For the general case, look at [12].
4.
This assumption is necessary to compare two algorithms.
5.
Scale Up is defined in [8] as the ratio \(SC(D_{k_i},D_{k_j}):=\frac{k_i}{k_j}\) and it measures the difference between the two algorithm respect to the number of operations they perform to solve the same problem.
6.
This is an initial, not realistic, assumption.
7.
There is no loss of generality because any operator can be rewritten as a number of elementary operators with execution time \(t_{calc}\).
8.
This is a semplified and very general logical description of a memory hierarchy behavior useful to the aim of the framework. Of course it could be adapted to an actual architecture, but the following definitions hold the same.
9.
Level 0 is the fastest one.
10.
In general \(c_{AM}\le nd\), but we can assume \(c_{AM}=nd\) without loss of generality.
11.
Meanly.
12.
Meanly.
13.
For example: in case of an algorithm like the one in [13], where the architecture is a heterogeneous GPU and Multicore based system, we can build different matrices for different parts of the algorithm.

References

Pennycook, S.J., Sewall, J.D., Lee, V.W.: Implications of a metric for performance portability. Future Gener. Comput. Syst. 92, 947–958 (2017). https://doi.org/10.1016/j.future.2017.08.007
Article Google Scholar
Kwack, J., et al.: Evaluating performance portability of HPC applications and benchmarks across diverse HPC architectures. Exascale Computing Project (ECP) Webinar. https://www.exascaleproject.org/event/performance-portability-evaluation/. Accessed 20 May 2020
DOE centres of excellence performance portability meeting: post-meeting report technical report LLNL-TR-700962. Lawrence Livermore National Laboratory, Livermore (2016). https://asc.llnl.gov/sites/asc/files/2020-09/COE-PP-Meeting-2016-FinalReport_0.pdf
Carracciuolo, L., Mele, V., Szustak, L.: About the granularity portability of block-based Krylov methods in heterogeneous computing environments. Concurr. Comput. Pract. Exp. 33(4), e6008 (2021). https://doi.org/10.1002/cpe.6008
Article Google Scholar
Neely, J.R.: DOE centers of excellence performance portability meeting. Technical report LLNL-TR-700962, 4. Lawrence Livermore National Laboratory (2016). https://doi.org/10.2172/1332474
Edwards, H.C., Trott, C.R., Sunderland, D.: Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74(12), 3202–3216 (2014). https://doi.org/10.1016/j.jpdc.2014.07.003
Article Google Scholar
Pennycook, J., Sewall, J., Jacobsen, D.W., Deakin, T., McIntosh-Smith, S.N.: Navigating performance, portability and productivity. Comput. Sci. Eng. 23(5), 28–38 (2021). https://doi.org/10.1109/MCSE.2021.3097276
Article Google Scholar
Mele, V., Romano, D., Constantinescu, E.M., Carracciuolo, L., D’Amore, L.: Performance evaluation for a PETSc parallel-in-time solver based on the MGRIT algorithm. In: Mencagli, G., et al. (eds.) Euro-Par 2018. LNCS, vol. 11339, pp. 716–728. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-10549-5_56
Chapter Google Scholar
D’Amore, L., Mele, V., Laccetti, G., Murli, A.: Mathematical approach to the performance evaluation of matrix multiply algorithm. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9574, pp. 25–34. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32152-3_3
Chapter Google Scholar
Mele, V., Constantinescu, E.M., Carracciuolo, L., D’amore, L.: A PETSc parallel-in-time solver based on MGRIT algorithm. Concurr. Comput. Pract. Exp. 30(24), e4928 (2018). https://doi.org/10.1002/cpe.4928
Article Google Scholar
D’Amore, L., Mel, V., Romano, D., Laccetti, G.: Multilevel algebraic approach for performance analysis of parallel algorithms. Comput. Inform. 38(4), 817–850 (2019). https://doi.org/10.31577/cai_2019_4_817
Article MathSciNet Google Scholar
Romano, D., Lapegna, M., Mele, V., Laccetti, G.: Designing a GPU-parallel algorithm for raw SAR data compression: a focus on parallel performance estimation. Future Gener. Comput. Syst. 112(6), 695–708 (2020). https://doi.org/10.1016/j.future.2020.06.027
Article Google Scholar
Laccetti, G., Lapegna, M., Mele, V., Romano, D.: A study on adaptive algorithms for numerical quadrature on heterogeneous GPU and multicore based systems. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds.) PPAM 2013. LNCS, vol. 8384, pp. 704–713. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-55224-3_66
Chapter Google Scholar
Laccetti, G., Lapegna, M., Mele, V.: A loosely coordinated model for heap-based priority queues in multicore environments. Int. J. Parallel Prog. 44(4), 901–921 (2015). https://doi.org/10.1007/s10766-015-0398-x
Article Google Scholar
Laccetti, G., Lapegna, M., Mele, V., Montella, R.: An adaptive algorithm for high-dimensional integrals on heterogeneous CPU-GPU systems. Concurr. Comput. Pract. Exp. 31(19), e4945 (2019). https://doi.org/10.1002/cpe.4945
Article Google Scholar
Montella, R., Giunta, G., Laccetti, G.: Virtualizing high-end GPGPUs on ARM clusters for the next generation of high performance cloud computing. Cluster Comput. 17(1), 139–152 (2014). https://doi.org/10.1007/s10586-013-0341-0
Article Google Scholar
Marcellino, L., et al.: Using GPGPU accelerated interpolation algorithms for marine bathymetry processing with on-premises and cloud based computational resources. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds.) PPAM 2017. LNCS, vol. 10778, pp. 14–24. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78054-2_2
Chapter Google Scholar
D’Amore, L., Campagna, R., Mele, V., Murli, A., Rizzardi, M.: ReLaTIve. An Ansi C90 software package for the Real Laplace Transform Inversion. Numerical Algorithms 63(1), 187–211 (2013). https://doi.org/10.1007/s11075-012-9636-0
Article MathSciNet MATH Google Scholar
D’Amore, L., Campagna, R., Mele, V., Murli, A.: Algorithm 946. ReLIADiff. An C++ software package for real Laplace transform inversion based on automatic differentiation. ACM Trans. Math. Softw. 40(4), 31:1–31:20 (2014). Article 31. https://doi.org/10.1145/2616971
D’Amore, L., Mele, V., Campagna, R.: Quality assurance of Gaver’s formula for multi-precision Laplace transform inversion in real case. Inverse Probl. Sci. Eng. 26(4), 553–580 (2018). https://doi.org/10.1080/17415977.2017.1322963
Article MathSciNet MATH Google Scholar
Tjaden. G.S., Flynn. M.J.: Detection and parallel execution of independent instructions. IEEE Trans. Comput. C-19(10), 889–895 (1970). https://doi.org/10.1109/T-C.1970.222795
Flatt, H.P., Kennedy, K.: Performance of parallel processors. Parallel Comput. 12(1), 1–20 (1989). https://doi.org/10.1016/0167-8191(89)90003-3
Article MathSciNet MATH Google Scholar
Maddalena, L., Petrosino, A., Laccetti, G.: A fusion-based approach to digital movie restoration. Pattern Recogn. 42(7), 1485–1495 (2009). https://doi.org/10.1016/j.patcog.2008.10.026
Article MATH Google Scholar
Hockney, R.W.: The Science of Computer Benchmarking. SIAM (1996)
Google Scholar
Ballard, G., Demmel, J., Knight, N.: Avoiding communication in successive band reduction. ACM Trans. Parallel Comput. 1(2), 37 (2015). Article 11. https://doi.org/10.1145/2686877
Koanantakool, P., et al.: Communication-avoiding parallel sparse-dense matrix-matrix multiplication. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 842–853 (2016). https://doi.org/10.1109/IPDPS.2016.117
Sao, P., Kannan, R., Li, X.S., Vuduc, R.: A communication-avoiding 3D sparse triangular solver. In: Proceedings of the ACM International Conference on Supercomputing (ICS 2019), pp. 127–137. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3330345.3330357
Kennedy, K., McKinley, K.S.: Optimizing for parallelism and data locality. In: Proceedings of the 6th International Conference on Supercomputing (ICS 1992), pp. 323–334. Association for Computing Machinery, New York (1992). https://doi.org/10.1145/143369.143427

Download references

Author information

Authors and Affiliations

University of Naples “Federico II”, Naples, Italy
Valeria Mele & Giuliano Laccetti

Authors

Valeria Mele
View author publications
You can also search for this author in PubMed Google Scholar
Giuliano Laccetti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Valeria Mele .

Editor information

Editors and Affiliations

Czestochowa University of Technology, Czestochowa, Poland
Roman Wyrzykowski
University of Tennessee, Knoxville, TN, USA
Jack Dongarra
University of Southern California, Marina del Rey, CA, USA
Ewa Deelman
Czestochowa University of Technology, Czestochowa, Poland
Konrad Karczewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mele, V., Laccetti, G. (2023). Algorithm and Software Overhead: A Theoretical Approach to Performance Portability. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2022. Lecture Notes in Computer Science, vol 13827. Springer, Cham. https://doi.org/10.1007/978-3-031-30445-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-30445-3_8
Published: 27 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30444-6
Online ISBN: 978-3-031-30445-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Algorithm and Software Overhead: A Theoretical Approach to Performance Portability