Automatic Mapping of Parallel Pattern-Based Algorithms on Heterogeneous Architectures

Trümper, Lukas; Miller, Julian; Terboven, Christian; Müller, Matthias S.

doi:10.1007/978-3-030-81682-7_4

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12800))

Included in the following conference series:

International Conference on Architecture of Computing Systems

723 Accesses
1 Citations

Abstract

Nowadays, specialized hardware is often found in clusters to improve compute performance and energy efficiency. The porting and tuning of scientific codes to these heterogeneous clusters requires significant development efforts. To mitigate these efforts while maintaining high performance, modern parallel programming models introduce a second layer of abstraction, where an architecture-agnostic source code can be maintained and automatically optimized for the target architecture. However, with increasing heterogeneity, the mapping of an application to a specific architecture itself becomes a complex decision requiring a differentiated consideration of processor features and algorithmic properties. Furthermore, architecture-agnostic global transformations are necessary to maximize the simultaneous utilization of different processors. Therefore, we introduce a combinatorial optimization approach to globally transform and automatically map parallel algorithms to heterogeneous architectures. We derive a global transformation and mapping algorithm which bases on a static performance model. Moreover, we demonstrate the approach on five typical algorithmic kernels showing automatic and global transformations such as loop fusion, re-ordering, pipelining, NUMA awareness, and optimal mapping strategies to an exemplary CPU-GPU compute node. Our algorithm achieves performance on par with hand-tuned implementations of all five kernels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Altaf, M.S.B., Wood, D.A.: LogCA: a performance model for hardware accelerators. IEEE Comput. Archit. Lett. 14(2), 132–135 (2014)
Article Google Scholar
Asanovic, K., et al.: A view of the parallel computing landscape. Commun. ACM 52(10), 56–67 (2009)
Google Scholar
Bacon, D.F., Graham, S.L., Sharp, O.J.: Compiler transformations for high-performance computing. ACM Comput. Surv. 26(4), 345–420 (1994)
Google Scholar
Bailey, D.H., et al.: The NAS parallel benchmarks summary and preliminary results. In: Supercomputing 1991: Proceedings of the 1991 ACM/IEEE conference on Supercomputing, pp. 158–165. IEEE (1991)
Google Scholar
Banerjee, S.-C.C., Kuck, T.: Time and parallel processor bounds for Fortran-like loops. IEEE Trans. Comput. C-28(9), 660–670 (1979)
Google Scholar
Beaumont, O., Becker, B.A., DeFlumere, A., Eyraud-Dubois, L., Lambert, T., Lastovetsky, A.: Recent advances in matrix partitioning for parallel computing on heterogeneous platforms. IEEE Trans. Parallel Distrib. Syst. 30(1), 218–229 (2018)
Article Google Scholar
Beckingsale, D.A., et al.: RAJA: portable performance for large-scale scientific applications. In: 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), pp. 71–81 (2019)
Google Scholar
Ben-Nun, T., de Fine Licht, J., Ziogas, A.N., Schneider, T., Hoefler, T.: Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC 2019. Association for Computing Machinery, New York (2019)
Google Scholar
Chakravarty, M.M.T., Keller, G., Lechtchinsky, R., Pfannenstiel, W.: Nepal — nested data parallelism in Haskell. In: Sakellariou, R., Gurd, J., Freeman, L., Keane, J. (eds.) Euro-Par 2001. LNCS, vol. 2150, pp. 524–534. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44681-8_76
Chapter Google Scholar
Chen, C., Chame, J., Hall, M.: Chill: a framework for composing high-level loop transformations. Technical report, Citeseer (2008)
Google Scholar
Culler, D., et al.: LogP: towards a realistic model of parallel computation. In: Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 1993, pp. 1–12. ACM (1993)
Google Scholar
Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)
Article Google Scholar
Edwards, H.C., Trott, C.R., Sunderland, D.: Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74(12), 3202–3216 (2014). Domain-Specific Languages and High-Level Frameworks for High-Performance Computing
Google Scholar
Gonzàlez-Vèlez, H., Leyton, M.: A survey of algorithmic skeleton frameworks: high-level structured parallel programming enablers. Softw. Pract. Exp. 40(12), 1135–1160 (2010)
Article Google Scholar
Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach, 5th edn. Morgan Kaufmann Publishers Inc., San Francisco (2011)
Google Scholar
Klee, V., Minty, G.J.: How good is the simplex algorithm? Inequalities 3(3), 159–175 (1972)
MathSciNet MATH Google Scholar
Kuchen, H.: Optimizing sequences of skeleton calls. In: Lengauer, C., Batory, D., Consel, C., Odersky, M. (eds.) Domain-Specific Program Generation. LNCS, vol. 3016, pp. 254–273. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25935-0_15
Chapter Google Scholar
Lenstra, J.K., Shmoys, D.B., Tardos, É.: Approximation algorithms for scheduling unrelated parallel machines. Math. Program. 46(1), 259–271 (1990)
Article MathSciNet Google Scholar
McCool, M., Reinders, J., Robison, A.: Structured Parallel Programming - Patterns for Efficient Computation. Elsevier, Amsterdam (2012)
Google Scholar
Message-Passing Interface Forum: A Message-Passing Interface Standard. http://www.mpi-forum.org/
Miller, J., Trümper, L., Terboven, C., Müller, M.S.: A Theoretical Model for Global Optimization of Parallel Algorithms, Manuscript submitted for publication, RWTH Aachen University, Germany
Google Scholar
Miller, J., Trümper, L., Terboven, C., Müller, M.S.: Poster: efficiency of algorithmic structures. In: IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC 2019) (2019)
Google Scholar
NVIDIA, Vingelmann, P., Fitzek, F.H.: CUDA Toolkit. https://developer.nvidia.com/cuda-toolkit
Sudholt, D.: Parallel evolutionary algorithms. In: Kacprzyk, J., Pedrycz, W. (eds.) Springer Handbook of Computational Intelligence, pp. 929–959. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-43505-2_46
Chapter Google Scholar
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)
Article Google Scholar

Download references

Acknowledgement

We thank Adrian Schmitz (RWTH Aachen University) for implementing the prototype language PPL and useful discussions. We also thank Huddly AS for supporting Lukas Trümper in this work.

Author information

Authors and Affiliations

Huddly AS, Oslo, Norway
Lukas Trümper
Chair for High Performance Computing, IT Center, RWTH Aachen University, Aachen, Germany
Lukas Trümper, Julian Miller, Christian Terboven & Matthias S. Müller

Authors

Lukas Trümper
View author publications
You can also search for this author in PubMed Google Scholar
Julian Miller
View author publications
You can also search for this author in PubMed Google Scholar
Christian Terboven
View author publications
You can also search for this author in PubMed Google Scholar
Matthias S. Müller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Julian Miller .

Editor information

Editors and Affiliations

Technische Universität Darmstadt, Darmstadt, Germany
Christian Hochberger
Karlsruhe Institute of Technology, Karlsruhe, Germany
Lars Bauer
Otto-von-Guericke University Magdeburg, Magdeburg, Germany
Thilo Pionteck

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 98 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Trümper, L., Miller, J., Terboven, C., Müller, M.S. (2021). Automatic Mapping of Parallel Pattern-Based Algorithms on Heterogeneous Architectures. In: Hochberger, C., Bauer, L., Pionteck, T. (eds) Architecture of Computing Systems. ARCS 2021. Lecture Notes in Computer Science(), vol 12800. Springer, Cham. https://doi.org/10.1007/978-3-030-81682-7_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-81682-7_4
Published: 15 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-81681-0
Online ISBN: 978-3-030-81682-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Automatic Mapping of Parallel Pattern-Based Algorithms on Heterogeneous Architectures