Skip to main content

Algorithm and Software Overhead: A Theoretical Approach to Performance Portability

  • Conference paper
  • First Online:
Parallel Processing and Applied Mathematics (PPAM 2022)

Abstract

In the last years, the portability term has enriched itself with new meanings: research communities are talking about how to measure the degree to which an application (or library, programming model, algorithm implementation, etc.) has become “performance portable”. The term “performance portability” has been informally used in computing communities to substantially refer to: (1) the ability to run one application across multiple hardware platforms; and (2) achieving some decent level of performance on these platforms [1, 2]. Among the efforts related to the “performance portability” issue, we note the annual performance portability workshops organized by the US Department of Energy [3]. This article intends to add a new point of view to the performance portability issue, starting from a more theoretical point of view, that shows the convenience of splitting the proper algorithm from the emphoverhead, and exploring the different factors that introduce different kind of overhead. The paper explores the theoretical framework to get a definition of the execution time of a software but that definition is not the point. The aim is to show and understand the link between that execution time and the beginning of the design, to exploit what part of any program is really environment-sensitive and exclude from performance portability formulas everything is not going to change, as theoretically shown.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Decomposition matrix is the name we preferred in this work, but in [8] it is referred as dependency matrix.

  2. 2.

    These can be basic operations (arithmetic,\(\ldots \)), special functions evaluations (\(\sin ,\cos ,\ldots \)), solvers (integrals, equations system, non-linear equations\(\ldots \)).

  3. 3.

    For the general case, look at [12].

  4. 4.

    This assumption is necessary to compare two algorithms.

  5. 5.

    Scale Up is defined in [8] as the ratio \(SC(D_{k_i},D_{k_j}):=\frac{k_i}{k_j}\) and it measures the difference between the two algorithm respect to the number of operations they perform to solve the same problem.

  6. 6.

    This is an initial, not realistic, assumption.

  7. 7.

    There is no loss of generality because any operator can be rewritten as a number of elementary operators with execution time \(t_{calc}\).

  8. 8.

    This is a semplified and very general logical description of a memory hierarchy behavior useful to the aim of the framework. Of course it could be adapted to an actual architecture, but the following definitions hold the same.

  9. 9.

    Level 0 is the fastest one.

  10. 10.

    In general \(c_{AM}\le nd\), but we can assume \(c_{AM}=nd\) without loss of generality.

  11. 11.

    Meanly.

  12. 12.

    Meanly.

  13. 13.

    For example: in case of an algorithm like the one in [13], where the architecture is a heterogeneous GPU and Multicore based system, we can build different matrices for different parts of the algorithm.

References

  1. Pennycook, S.J., Sewall, J.D., Lee, V.W.: Implications of a metric for performance portability. Future Gener. Comput. Syst. 92, 947–958 (2017). https://doi.org/10.1016/j.future.2017.08.007

    Article  Google Scholar 

  2. Kwack, J., et al.: Evaluating performance portability of HPC applications and benchmarks across diverse HPC architectures. Exascale Computing Project (ECP) Webinar. https://www.exascaleproject.org/event/performance-portability-evaluation/. Accessed 20 May 2020

  3. DOE centres of excellence performance portability meeting: post-meeting report technical report LLNL-TR-700962. Lawrence Livermore National Laboratory, Livermore (2016). https://asc.llnl.gov/sites/asc/files/2020-09/COE-PP-Meeting-2016-FinalReport_0.pdf

  4. Carracciuolo, L., Mele, V., Szustak, L.: About the granularity portability of block-based Krylov methods in heterogeneous computing environments. Concurr. Comput. Pract. Exp. 33(4), e6008 (2021). https://doi.org/10.1002/cpe.6008

    Article  Google Scholar 

  5. Neely, J.R.: DOE centers of excellence performance portability meeting. Technical report LLNL-TR-700962, 4. Lawrence Livermore National Laboratory (2016). https://doi.org/10.2172/1332474

  6. Edwards, H.C., Trott, C.R., Sunderland, D.: Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74(12), 3202–3216 (2014). https://doi.org/10.1016/j.jpdc.2014.07.003

    Article  Google Scholar 

  7. Pennycook, J., Sewall, J., Jacobsen, D.W., Deakin, T., McIntosh-Smith, S.N.: Navigating performance, portability and productivity. Comput. Sci. Eng. 23(5), 28–38 (2021). https://doi.org/10.1109/MCSE.2021.3097276

    Article  Google Scholar 

  8. Mele, V., Romano, D., Constantinescu, E.M., Carracciuolo, L., D’Amore, L.: Performance evaluation for a PETSc parallel-in-time solver based on the MGRIT algorithm. In: Mencagli, G., et al. (eds.) Euro-Par 2018. LNCS, vol. 11339, pp. 716–728. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-10549-5_56

    Chapter  Google Scholar 

  9. D’Amore, L., Mele, V., Laccetti, G., Murli, A.: Mathematical approach to the performance evaluation of matrix multiply algorithm. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9574, pp. 25–34. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32152-3_3

    Chapter  Google Scholar 

  10. Mele, V., Constantinescu, E.M., Carracciuolo, L., D’amore, L.: A PETSc parallel-in-time solver based on MGRIT algorithm. Concurr. Comput. Pract. Exp. 30(24), e4928 (2018). https://doi.org/10.1002/cpe.4928

    Article  Google Scholar 

  11. D’Amore, L., Mel, V., Romano, D., Laccetti, G.: Multilevel algebraic approach for performance analysis of parallel algorithms. Comput. Inform. 38(4), 817–850 (2019). https://doi.org/10.31577/cai_2019_4_817

    Article  MathSciNet  Google Scholar 

  12. Romano, D., Lapegna, M., Mele, V., Laccetti, G.: Designing a GPU-parallel algorithm for raw SAR data compression: a focus on parallel performance estimation. Future Gener. Comput. Syst. 112(6), 695–708 (2020). https://doi.org/10.1016/j.future.2020.06.027

    Article  Google Scholar 

  13. Laccetti, G., Lapegna, M., Mele, V., Romano, D.: A study on adaptive algorithms for numerical quadrature on heterogeneous GPU and multicore based systems. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds.) PPAM 2013. LNCS, vol. 8384, pp. 704–713. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-55224-3_66

    Chapter  Google Scholar 

  14. Laccetti, G., Lapegna, M., Mele, V.: A loosely coordinated model for heap-based priority queues in multicore environments. Int. J. Parallel Prog. 44(4), 901–921 (2015). https://doi.org/10.1007/s10766-015-0398-x

    Article  Google Scholar 

  15. Laccetti, G., Lapegna, M., Mele, V., Montella, R.: An adaptive algorithm for high-dimensional integrals on heterogeneous CPU-GPU systems. Concurr. Comput. Pract. Exp. 31(19), e4945 (2019). https://doi.org/10.1002/cpe.4945

    Article  Google Scholar 

  16. Montella, R., Giunta, G., Laccetti, G.: Virtualizing high-end GPGPUs on ARM clusters for the next generation of high performance cloud computing. Cluster Comput. 17(1), 139–152 (2014). https://doi.org/10.1007/s10586-013-0341-0

    Article  Google Scholar 

  17. Marcellino, L., et al.: Using GPGPU accelerated interpolation algorithms for marine bathymetry processing with on-premises and cloud based computational resources. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds.) PPAM 2017. LNCS, vol. 10778, pp. 14–24. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78054-2_2

    Chapter  Google Scholar 

  18. D’Amore, L., Campagna, R., Mele, V., Murli, A., Rizzardi, M.: ReLaTIve. An Ansi C90 software package for the Real Laplace Transform Inversion. Numerical Algorithms 63(1), 187–211 (2013). https://doi.org/10.1007/s11075-012-9636-0

    Article  MathSciNet  MATH  Google Scholar 

  19. D’Amore, L., Campagna, R., Mele, V., Murli, A.: Algorithm 946. ReLIADiff. An C++ software package for real Laplace transform inversion based on automatic differentiation. ACM Trans. Math. Softw. 40(4), 31:1–31:20 (2014). Article 31. https://doi.org/10.1145/2616971

  20. D’Amore, L., Mele, V., Campagna, R.: Quality assurance of Gaver’s formula for multi-precision Laplace transform inversion in real case. Inverse Probl. Sci. Eng. 26(4), 553–580 (2018). https://doi.org/10.1080/17415977.2017.1322963

    Article  MathSciNet  MATH  Google Scholar 

  21. Tjaden. G.S., Flynn. M.J.: Detection and parallel execution of independent instructions. IEEE Trans. Comput. C-19(10), 889–895 (1970). https://doi.org/10.1109/T-C.1970.222795

  22. Flatt, H.P., Kennedy, K.: Performance of parallel processors. Parallel Comput. 12(1), 1–20 (1989). https://doi.org/10.1016/0167-8191(89)90003-3

    Article  MathSciNet  MATH  Google Scholar 

  23. Maddalena, L., Petrosino, A., Laccetti, G.: A fusion-based approach to digital movie restoration. Pattern Recogn. 42(7), 1485–1495 (2009). https://doi.org/10.1016/j.patcog.2008.10.026

    Article  MATH  Google Scholar 

  24. Hockney, R.W.: The Science of Computer Benchmarking. SIAM (1996)

    Google Scholar 

  25. Ballard, G., Demmel, J., Knight, N.: Avoiding communication in successive band reduction. ACM Trans. Parallel Comput. 1(2), 37 (2015). Article 11. https://doi.org/10.1145/2686877

  26. Koanantakool, P., et al.: Communication-avoiding parallel sparse-dense matrix-matrix multiplication. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 842–853 (2016). https://doi.org/10.1109/IPDPS.2016.117

  27. Sao, P., Kannan, R., Li, X.S., Vuduc, R.: A communication-avoiding 3D sparse triangular solver. In: Proceedings of the ACM International Conference on Supercomputing (ICS 2019), pp. 127–137. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3330345.3330357

  28. Kennedy, K., McKinley, K.S.: Optimizing for parallelism and data locality. In: Proceedings of the 6th International Conference on Supercomputing (ICS 1992), pp. 323–334. Association for Computing Machinery, New York (1992). https://doi.org/10.1145/143369.143427

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Valeria Mele .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mele, V., Laccetti, G. (2023). Algorithm and Software Overhead: A Theoretical Approach to Performance Portability. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2022. Lecture Notes in Computer Science, vol 13827. Springer, Cham. https://doi.org/10.1007/978-3-031-30445-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-30445-3_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-30444-6

  • Online ISBN: 978-3-031-30445-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics