Skip to main content

Reactive Task Migration for Hybrid MPI+OpenMP Applications

  • Conference paper
  • First Online:
Parallel Processing and Applied Mathematics (PPAM 2019)

Abstract

Many applications in high performance computing are designed based on underlying performance and execution models. While these models could successfully be employed in the past for balancing load within and between compute nodes, modern software and hardware increasingly make performance predictability difficult if not impossible. Consequently, balancing computational load becomes much more difficult. Aiming to tackle these challenges in search for a general solution, we present a novel library for fine-granular task-based reactive load balancing in distributed memory based on MPI and OpenMP. With our approach, individual migratable tasks can be executed on any MPI rank. The actual executing rank is determined at run time based on online performance data. We evaluate our approach under an enforced power cap and under enforced clock frequency changes for a synthetic benchmark and show its robustness for work-induced imbalances for a realistic application. Our experiments demonstrate speedups of up to \(1.31\text {X}\).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Migration decision are made on a each rank separately based on per rank load information that has been exchanged before. Consequently, this step does not require any additional two-sided or collective communication.

  2. 2.

    Although we planned to conduct the tests on our new Intel Xeon Skylake processors, this partition was still in the process of getting into production at the time of creating the paper.

References

  1. Acun, B., Miller, P., Kale, L.V.: Variation among processors under Turbo Boost in HPC systems. In: Proceedings of the 2016 International Conference on Supercomputing, ICS 2016, pp. 6:1–6:12. ACM, New York (2016). https://doi.org/10.1145/2925426.2926289

  2. Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. In: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 1995, pp. 207–216. ACM, New York (1995). https://doi.org/10.1145/209936.209958

  3. Charles, J., Jassi, P., Ananth, N.S., Sadat, A., Fedorova, A.: Evaluation of the Intel® Core™ i7 Turbo Boost feature. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), IISWC 2009, pp. 188–197. IEEE Computer Society, Washington, DC (2009). https://doi.org/10.1109/IISWC.2009.5306782

  4. Denis, A., Jaeger, J., Taboada, H.: Progress thread placement for overlapping MPI non-blocking collectives using simultaneous multi-threading. In: Mencagli, G., et al. (eds.) Euro-Par 2018. LNCS, vol. 11339, pp. 123–133. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-10549-5_10

    Chapter  Google Scholar 

  5. Dinan, J., Larkins, D.B., Sadayappan, P., Krishnamoorthy, S., Nieplocha, J.: Scalable work stealing. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 2009, pp. 1–11, November 2009. https://doi.org/10.1145/1654059.1654113

  6. Hoefler, T., Lumsdaine, A.: Message progression in parallel computing - to thread or not to thread? In: Proceedings - IEEE International Conference on Cluster Computing, ICCC. Proceeding, pp. 213–222, September 2008. https://doi.org/10.1109/CLUSTR.2008.4663774

  7. Inadomi, Y., et al.: Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, pp. 78:1–78:12. ACM, New York (2015). https://doi.org/10.1145/2807591.2807638

  8. Kale, L.V., Krishnan, S.: CHARM++: a portable concurrent object oriented system based on C++. SIGPLAN Not. 28(10), 91–108 (1993). https://doi.org/10.1145/167962.165874

    Article  Google Scholar 

  9. Meister, O., Rahnema, K., Bader, M.: Parallel memory-efficient adaptive mesh refinement on structured triangular meshes with billions of grid cells. ACM Trans. Math. Softw. 43(3), 1–27 (2016). https://doi.org/10.1145/2947668

    Article  MathSciNet  MATH  Google Scholar 

  10. OpenMP Architecture Review Board: OpenMP Application Program Interface, Version 5.0, November 2018. http://www.openmp.org/

  11. Pinar, A., Aykanat, C.: Fast optimal load balancing algorithms for 1D partitioning. J. Parallel Distri. Comput. 64(8), 974–996 (2004). https://doi.org/10.1016/j.jpdc.2004.05.003

    Article  MATH  Google Scholar 

  12. Reinders, J.: Intel Threading Building Blocks, 1st edn. O’Reilly & Associates Inc., Sebastopol (2007)

    Google Scholar 

  13. Samfass, P., Klinkenberg, J., Bader, M.: Hybrid MPI+OpenMP reactive work stealing in distributed memory in the PDE framework sam(oa\()^2\). In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), CLUSTER 2018, pp. 337–347. IEEE, September 2018. https://doi.org/10.1109/CLUSTER.2018.00051

  14. Treibig, J., Hager, G., Wellein, G.: LIKWID: a lightweight performance-oriented tool suite for x86 multicore environments. In: Proceedings of PSTI 2010, The First International Workshop on Parallel Software Tools and Tool Infrastructures, San Diego CA (2010)

    Google Scholar 

  15. Zanotti, O., Fambri, F., Dumbser, M., Hidalgo, A.: Space–time adaptive ADER discontinuous Galerkin finite element schemes with a posteriori sub-cell finite volume limiting. Comput. Fluids 118, 204–224 (2015). https://doi.org/10.1016/j.compfluid.2015.06.020, http://www.sciencedirect.com/science/article/pii/S0045793015002030

Download references

Acknowledgements

Some of the experiments were performed with computing resources granted by JARA-HPC from RWTH Aachen University under projects jara0001 and nova0027. Parts of this work were funded by the German Federal Ministry of Education and Research (BMBF) under grant numbers 01IH16004B and 01IH16004C (Project Chameleon).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jannis Klinkenberg .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Klinkenberg, J., Samfass, P., Bader, M., Terboven, C., Müller, M.S. (2020). Reactive Task Migration for Hybrid MPI+OpenMP Applications. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2019. Lecture Notes in Computer Science(), vol 12044. Springer, Cham. https://doi.org/10.1007/978-3-030-43222-5_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-43222-5_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-43221-8

  • Online ISBN: 978-3-030-43222-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics