Abstract
Many applications in high performance computing are designed based on underlying performance and execution models. While these models could successfully be employed in the past for balancing load within and between compute nodes, modern software and hardware increasingly make performance predictability difficult if not impossible. Consequently, balancing computational load becomes much more difficult. Aiming to tackle these challenges in search for a general solution, we present a novel library for fine-granular task-based reactive load balancing in distributed memory based on MPI and OpenMP. With our approach, individual migratable tasks can be executed on any MPI rank. The actual executing rank is determined at run time based on online performance data. We evaluate our approach under an enforced power cap and under enforced clock frequency changes for a synthetic benchmark and show its robustness for work-induced imbalances for a realistic application. Our experiments demonstrate speedups of up to \(1.31\text {X}\).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Migration decision are made on a each rank separately based on per rank load information that has been exchanged before. Consequently, this step does not require any additional two-sided or collective communication.
- 2.
Although we planned to conduct the tests on our new Intel Xeon Skylake processors, this partition was still in the process of getting into production at the time of creating the paper.
References
Acun, B., Miller, P., Kale, L.V.: Variation among processors under Turbo Boost in HPC systems. In: Proceedings of the 2016 International Conference on Supercomputing, ICS 2016, pp. 6:1–6:12. ACM, New York (2016). https://doi.org/10.1145/2925426.2926289
Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. In: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 1995, pp. 207–216. ACM, New York (1995). https://doi.org/10.1145/209936.209958
Charles, J., Jassi, P., Ananth, N.S., Sadat, A., Fedorova, A.: Evaluation of the Intel® Core™ i7 Turbo Boost feature. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), IISWC 2009, pp. 188–197. IEEE Computer Society, Washington, DC (2009). https://doi.org/10.1109/IISWC.2009.5306782
Denis, A., Jaeger, J., Taboada, H.: Progress thread placement for overlapping MPI non-blocking collectives using simultaneous multi-threading. In: Mencagli, G., et al. (eds.) Euro-Par 2018. LNCS, vol. 11339, pp. 123–133. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-10549-5_10
Dinan, J., Larkins, D.B., Sadayappan, P., Krishnamoorthy, S., Nieplocha, J.: Scalable work stealing. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 2009, pp. 1–11, November 2009. https://doi.org/10.1145/1654059.1654113
Hoefler, T., Lumsdaine, A.: Message progression in parallel computing - to thread or not to thread? In: Proceedings - IEEE International Conference on Cluster Computing, ICCC. Proceeding, pp. 213–222, September 2008. https://doi.org/10.1109/CLUSTR.2008.4663774
Inadomi, Y., et al.: Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, pp. 78:1–78:12. ACM, New York (2015). https://doi.org/10.1145/2807591.2807638
Kale, L.V., Krishnan, S.: CHARM++: a portable concurrent object oriented system based on C++. SIGPLAN Not. 28(10), 91–108 (1993). https://doi.org/10.1145/167962.165874
Meister, O., Rahnema, K., Bader, M.: Parallel memory-efficient adaptive mesh refinement on structured triangular meshes with billions of grid cells. ACM Trans. Math. Softw. 43(3), 1–27 (2016). https://doi.org/10.1145/2947668
OpenMP Architecture Review Board: OpenMP Application Program Interface, Version 5.0, November 2018. http://www.openmp.org/
Pinar, A., Aykanat, C.: Fast optimal load balancing algorithms for 1D partitioning. J. Parallel Distri. Comput. 64(8), 974–996 (2004). https://doi.org/10.1016/j.jpdc.2004.05.003
Reinders, J.: Intel Threading Building Blocks, 1st edn. O’Reilly & Associates Inc., Sebastopol (2007)
Samfass, P., Klinkenberg, J., Bader, M.: Hybrid MPI+OpenMP reactive work stealing in distributed memory in the PDE framework sam(oa\()^2\). In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), CLUSTER 2018, pp. 337–347. IEEE, September 2018. https://doi.org/10.1109/CLUSTER.2018.00051
Treibig, J., Hager, G., Wellein, G.: LIKWID: a lightweight performance-oriented tool suite for x86 multicore environments. In: Proceedings of PSTI 2010, The First International Workshop on Parallel Software Tools and Tool Infrastructures, San Diego CA (2010)
Zanotti, O., Fambri, F., Dumbser, M., Hidalgo, A.: Space–time adaptive ADER discontinuous Galerkin finite element schemes with a posteriori sub-cell finite volume limiting. Comput. Fluids 118, 204–224 (2015). https://doi.org/10.1016/j.compfluid.2015.06.020, http://www.sciencedirect.com/science/article/pii/S0045793015002030
Acknowledgements
Some of the experiments were performed with computing resources granted by JARA-HPC from RWTH Aachen University under projects jara0001 and nova0027. Parts of this work were funded by the German Federal Ministry of Education and Research (BMBF) under grant numbers 01IH16004B and 01IH16004C (Project Chameleon).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Klinkenberg, J., Samfass, P., Bader, M., Terboven, C., Müller, M.S. (2020). Reactive Task Migration for Hybrid MPI+OpenMP Applications. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2019. Lecture Notes in Computer Science(), vol 12044. Springer, Cham. https://doi.org/10.1007/978-3-030-43222-5_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-43222-5_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43221-8
Online ISBN: 978-3-030-43222-5
eBook Packages: Computer ScienceComputer Science (R0)