Skip to main content

Automatic Mapping of Parallel Pattern-Based Algorithms on Heterogeneous Architectures

  • Conference paper
  • First Online:
Architecture of Computing Systems (ARCS 2021)

Abstract

Nowadays, specialized hardware is often found in clusters to improve compute performance and energy efficiency. The porting and tuning of scientific codes to these heterogeneous clusters requires significant development efforts. To mitigate these efforts while maintaining high performance, modern parallel programming models introduce a second layer of abstraction, where an architecture-agnostic source code can be maintained and automatically optimized for the target architecture. However, with increasing heterogeneity, the mapping of an application to a specific architecture itself becomes a complex decision requiring a differentiated consideration of processor features and algorithmic properties. Furthermore, architecture-agnostic global transformations are necessary to maximize the simultaneous utilization of different processors. Therefore, we introduce a combinatorial optimization approach to globally transform and automatically map parallel algorithms to heterogeneous architectures. We derive a global transformation and mapping algorithm which bases on a static performance model. Moreover, we demonstrate the approach on five typical algorithmic kernels showing automatic and global transformations such as loop fusion, re-ordering, pipelining, NUMA awareness, and optimal mapping strategies to an exemplary CPU-GPU compute node. Our algorithm achieves performance on par with hand-tuned implementations of all five kernels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Altaf, M.S.B., Wood, D.A.: LogCA: a performance model for hardware accelerators. IEEE Comput. Archit. Lett. 14(2), 132–135 (2014)

    Article  Google Scholar 

  2. Asanovic, K., et al.: A view of the parallel computing landscape. Commun. ACM 52(10), 56–67 (2009)

    Google Scholar 

  3. Bacon, D.F., Graham, S.L., Sharp, O.J.: Compiler transformations for high-performance computing. ACM Comput. Surv. 26(4), 345–420 (1994)

    Google Scholar 

  4. Bailey, D.H., et al.: The NAS parallel benchmarks summary and preliminary results. In: Supercomputing 1991: Proceedings of the 1991 ACM/IEEE conference on Supercomputing, pp. 158–165. IEEE (1991)

    Google Scholar 

  5. Banerjee, S.-C.C., Kuck, T.: Time and parallel processor bounds for Fortran-like loops. IEEE Trans. Comput. C-28(9), 660–670 (1979)

    Google Scholar 

  6. Beaumont, O., Becker, B.A., DeFlumere, A., Eyraud-Dubois, L., Lambert, T., Lastovetsky, A.: Recent advances in matrix partitioning for parallel computing on heterogeneous platforms. IEEE Trans. Parallel Distrib. Syst. 30(1), 218–229 (2018)

    Article  Google Scholar 

  7. Beckingsale, D.A., et al.: RAJA: portable performance for large-scale scientific applications. In: 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), pp. 71–81 (2019)

    Google Scholar 

  8. Ben-Nun, T., de Fine Licht, J., Ziogas, A.N., Schneider, T., Hoefler, T.: Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC 2019. Association for Computing Machinery, New York (2019)

    Google Scholar 

  9. Chakravarty, M.M.T., Keller, G., Lechtchinsky, R., Pfannenstiel, W.: Nepal — nested data parallelism in Haskell. In: Sakellariou, R., Gurd, J., Freeman, L., Keane, J. (eds.) Euro-Par 2001. LNCS, vol. 2150, pp. 524–534. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44681-8_76

    Chapter  Google Scholar 

  10. Chen, C., Chame, J., Hall, M.: Chill: a framework for composing high-level loop transformations. Technical report, Citeseer (2008)

    Google Scholar 

  11. Culler, D., et al.: LogP: towards a realistic model of parallel computation. In: Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 1993, pp. 1–12. ACM (1993)

    Google Scholar 

  12. Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)

    Article  Google Scholar 

  13. Edwards, H.C., Trott, C.R., Sunderland, D.: Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74(12), 3202–3216 (2014). Domain-Specific Languages and High-Level Frameworks for High-Performance Computing

    Google Scholar 

  14. Gonzàlez-Vèlez, H., Leyton, M.: A survey of algorithmic skeleton frameworks: high-level structured parallel programming enablers. Softw. Pract. Exp. 40(12), 1135–1160 (2010)

    Article  Google Scholar 

  15. Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach, 5th edn. Morgan Kaufmann Publishers Inc., San Francisco (2011)

    Google Scholar 

  16. Klee, V., Minty, G.J.: How good is the simplex algorithm? Inequalities 3(3), 159–175 (1972)

    MathSciNet  MATH  Google Scholar 

  17. Kuchen, H.: Optimizing sequences of skeleton calls. In: Lengauer, C., Batory, D., Consel, C., Odersky, M. (eds.) Domain-Specific Program Generation. LNCS, vol. 3016, pp. 254–273. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25935-0_15

    Chapter  Google Scholar 

  18. Lenstra, J.K., Shmoys, D.B., Tardos, É.: Approximation algorithms for scheduling unrelated parallel machines. Math. Program. 46(1), 259–271 (1990)

    Article  MathSciNet  Google Scholar 

  19. McCool, M., Reinders, J., Robison, A.: Structured Parallel Programming - Patterns for Efficient Computation. Elsevier, Amsterdam (2012)

    Google Scholar 

  20. Message-Passing Interface Forum: A Message-Passing Interface Standard. http://www.mpi-forum.org/

  21. Miller, J., Trümper, L., Terboven, C., Müller, M.S.: A Theoretical Model for Global Optimization of Parallel Algorithms, Manuscript submitted for publication, RWTH Aachen University, Germany

    Google Scholar 

  22. Miller, J., Trümper, L., Terboven, C., Müller, M.S.: Poster: efficiency of algorithmic structures. In: IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC 2019) (2019)

    Google Scholar 

  23. NVIDIA, Vingelmann, P., Fitzek, F.H.: CUDA Toolkit. https://developer.nvidia.com/cuda-toolkit

  24. Sudholt, D.: Parallel evolutionary algorithms. In: Kacprzyk, J., Pedrycz, W. (eds.) Springer Handbook of Computational Intelligence, pp. 929–959. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-43505-2_46

    Chapter  Google Scholar 

  25. Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)

    Article  Google Scholar 

Download references

Acknowledgement

We thank Adrian Schmitz (RWTH Aachen University) for implementing the prototype language PPL and useful discussions. We also thank Huddly AS for supporting Lukas Trümper in this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Julian Miller .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 98 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Trümper, L., Miller, J., Terboven, C., Müller, M.S. (2021). Automatic Mapping of Parallel Pattern-Based Algorithms on Heterogeneous Architectures. In: Hochberger, C., Bauer, L., Pionteck, T. (eds) Architecture of Computing Systems. ARCS 2021. Lecture Notes in Computer Science(), vol 12800. Springer, Cham. https://doi.org/10.1007/978-3-030-81682-7_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-81682-7_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-81681-0

  • Online ISBN: 978-3-030-81682-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics