Skip to main content

OmpSs-2@Cluster: Distributed Memory Execution of Nested OpenMP-style Tasks

  • Conference paper
  • First Online:
Euro-Par 2022: Parallel Processing (Euro-Par 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13440))

Included in the following conference series:

Abstract

State-of-the-art programming approaches generally have a strict division between intra-node shared memory parallelism and inter-node MPI communication. Tasking with dependencies offers a clean, dependable abstraction for a wide range of hardware and situations within a node, but research on task offloading between nodes is still relatively immature. This paper presents a flexible task offloading extension of the OmpSs-2 programming model, which inherits task ordering from a sequential version of the code and uses a common address space to avoid address translation and simplify the use of data structures with pointers. It uses weak dependencies to enable work to be created concurrently. The program is executed in distributed dataflow fashion, and the runtime system overlaps the construction of the distributed dependency graph, enforces dependencies, transfers data, and schedules tasks for execution. Asynchronous task parallelism avoids synchronization that is often required in MPI+OpenMP tasks. Task scheduling is flexible, and data location is tracked through the dependencies. We wish to enable future work in resiliency, scalability, load balancing and malleability, and therefore release all source code and examples open source.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Currently nowait is available through a Nanos6 API call.

References

  1. Aguilar Mena, J., Shaaban, O., Beltran, V., Carpenter, P., Ayguade, E., Labarta Mancho, J.: Artifact and instructions to generate experimental results for the Euro-Par 2022 paper: “OmpSs-2@Cluster: Distributed memory execution of nested OpenMP-style tasks” (2022). https://doi.org/10.6084/m9.figshare.19960721

  2. Álvarez, D., Sala, K., Maroñas, M., Roca, A., Beltran, V.: Advanced synchronization techniques for task-based runtime systems, pp. 334–347. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3437801.3441601

  3. Augonnet, C., Aumage, O., Furmento, N., Namyst, R., Thibault, S.: StarPU-MPI: task programming over clusters of machines enhanced with accelerators. In: European MPI Users’ Group Meeting, pp. 298–299. Springer, Berlin Heidelberg (2012). https://doi.org/10.1007/978-3-642-33518-1_40

  4. Barcelona Supercomputing Center: MareNostrum 4 (2017) System Architecture (2017). https://www.bsc.es/marenostrum/marenostrum/technical-information

  5. Barcelona Supercomputing Center: Mercurium (2021). https://pm.bsc.es/mcxx

  6. Barcelona Supercomputing Center: Nanos6 (2021). https://github.com/bsc-pm/nanos6

  7. Barcelona Supercomputing Center: OmpSs-2 releases (2021). https://github.com/bsc-pm/ompss-releases

  8. Barcelona Supercomputing Center: OmpSs-2 specification (2021). https://pm.bsc.es/ftp/ompss-2/doc/spec/

  9. Barcelona Supercomputing Center: OmpSs-2@Cluster releases (2022). https://github.com/bsc-pm/ompss-2-cluster-releases

  10. Bauer, M., Treichler, S., Slaughter, E., Aiken, A.: Legion: expressing locality and independence with logical regions. In: SC ’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2012). https://doi.org/10.1109/SC.2012.71

  11. Beckingsale, D.A., et al.: Raja: portable performance for large-scale scientific applications. In: IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) (2019). https://doi.org/10.1109/P3HPC49587.2019.00012

  12. Blumofe, R., Joerg, C., Kuszmaul, B., Leiserson, C., Randall, K., Zhou, Y.: Cilk: an efficient multithreaded runtime system. J. Parallel Distrib. Comput. 37 (1999). https://doi.org/10.1006/jpdc.1996.0107

  13. Bueno, J., et al.: Productive programming of GPU clusters with OmpSs. In: IEEE 26th International Parallel and Distributed Processing Symposium (2012). https://doi.org/10.1109/IPDPS.2012.58

  14. Charles, P., et al.: X10: an object-oriented approach to non-uniform cluster computing. In: Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pp. 519–538. OOPSLA 2005. Association for Computing Machinery, New York, NY, USA (2005). https://doi.org/10.1145/1094811.1094852

  15. Deelman, E., et al.: Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci. Programm. 13(3), 219–237 (2005). https://doi.org/10.1155/2005/128026

  16. Faxén, K.F.: Wool user’s guide. Technical report, Swedish Institute of Computer Science (2009)

    Google Scholar 

  17. Fürlinger, K., et al.: DASH: distributed data structures and parallel algorithms in a global address space. In: Software for Exascale Computing-SPPEXA 2016–2019, pp. 103–142. Springer International Publishing (2020). https://doi.org/10.1007/978-3-030-47956-5_6

  18. Hoque, R., Herault, T., Bosilca, G., Dongarra, J.: Dynamic task discovery in parsec: a data-flow task-based runtime. In: Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, pp. 1–8 (2017). https://doi.org/10.1145/3148226.3148233

  19. Kaiser, H., Heller, T., Adelstein-Lelbach, B., Serio, A., Fey, D.: HPX: a task based programming model in a global address space. In: 8th International Conference on Partitioned Global Address Space Programming Models (2014). https://doi.org/10.13140/2.1.2635.5204

  20. Klinkenberg, J., Samfass, P., Bader, M., Terboven, C., Müller, M.: CHAMELEON: reactive load balancing for hybrid MPI+OpenMP task-parallel applications. J. Parallel Distrib. Comput. 138 (2019). https://doi.org/10.1016/j.jpdc.2019.12.005

  21. Leiserson, C.E.: The Cilk++ concurrency platform. J. Supercomput. 51(3), 244–257 (2010). https://doi.org/10.1007/s11227-010-0405-3

    Article  Google Scholar 

  22. Lordan, F., et al.: ServiceSs: an interoperable programming framework for the cloud. J. Grid Comput. 12(1), 67–91 (2013). https://doi.org/10.1007/s10723-013-9272-5

  23. OpenMP Architecture Review Board: OpenMP 4.0 complete specifications, July 2013

    Google Scholar 

  24. Papakonstantinou, N., Zakkak, F.S., Pratikakis, P.: Hierarchical parallel dynamic dependence analysis for recursively task-parallel programs. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 933–942 (2016). https://doi.org/10.1109/IPDPS.2016.53

  25. Parallel Programming Lab, Dept of Computer Science, U.o.I.: Charm++ documentation. https://charm.readthedocs.io/en/latest/index.html

  26. Perez, J.M., Beltran, V., Labarta, J., Ayguadé, E.: Improving the integration of task nesting and dependencies in OpenMP. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 809–818 (2017). https://doi.org/10.1109/IPDPS.2017.69

  27. Pérez, J., Badia, R.M., Labarta, J.: A dependency-aware task-based programming environment for multi-core architectures. In: Proceedings - IEEE International Conference on Cluster Computing, ICCC, pp. 142–151 (2008). https://doi.org/10.1109/CLUSTR.2008.4663765

  28. Rotaru, T., Rahn, M., Pfreundt, F.J.: MapReduce in GPI-Space. In: Euro-Par 2013: Parallel Processing Workshops, pp. 43–52. Springer, Berlin, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54420-0_5

  29. Sala, K., Macià, S., Beltran, V.: Combining one-sided communications with task-based programming models. In: 2021 IEEE International Conference on Cluster Computing (CLUSTER), pp. 528–541 (2021). https://doi.org/10.1109/Cluster48925.2021.00024

  30. Sala, K., Teruel, X., Perez, J.M., Peña, A.J., Beltran, V., Labarta, J.: Integrating blocking and non-blocking MPI primitives with task-based programming models. Parallel Comput. 85, 153–166 (2019). https://doi.org/10.1016/j.parco.2018.12.008

    Article  Google Scholar 

  31. Sergent, M., Goudin, D., Thibault, S., Aumage, O.: Controlling the memory subscription of distributed applications with a task-based runtime system. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 318–327 (2016). https://doi.org/10.1109/IPDPSW.2016.105

  32. Tillenius, M.: SuperGlue: a shared memory framework using data versioning for dependency-aware task-based parallelization. SIAM J. Scient. Comput. 37(6) (2015). https://doi.org/10.1137/140989716

  33. Tzenakis, G., Papatriantafyllou, A., Kesapides, J., Pratikakis, P., Vandierendonck, H., Nikolopoulos, D.S.: BDDT: Block-level dynamic dependence analysis for deterministic task-based parallelism. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP 2012, vol. 47, pp. 301–302. Association for Computing Machinery, New York, NY, USA (2012). https://doi.org/10.1145/2145816.2145864

  34. Virouleau, P., Broquedis, F., Gautier, T., Rastello, F.: Using data dependencies to improve task-based scheduling strategies on NUMA architectures. In: European Conference on Parallel Processing, pp. 531–544. Springer (2016). https://doi.org/10.1007/978-3-319-43659-3_39

  35. Zafari, A., Larsson, E., Tillenius, M.: DuctTeip: an efficient programming model for distributed task-based parallel computing. Parallel Comput. (2019). https://doi.org/10.1016/j.parco.2019.102582

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements and Data Availability

The datasets and code generated during and/or analyzed during the current study are available in the Figshare repository: [1]. This research has received funding from the European Union’s Horizon 2020/EuroHPC research and innovation programme under grant agreement No 955606 (DEEP-SEA) and 754337 (EuroEXA). It is supported by the Spanish State Research Agency - Ministry of Science and Innovation (contract PID2019-107255GB and Ramon y Cajal fellowship RYC2018-025628-I) and by the Generalitat de Catalunya (2017-SGR-1414).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jimmy Aguilar Mena .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Aguilar Mena, J., Shaaban, O., Beltran, V., Carpenter, P., Ayguade, E., Labarta Mancho, J. (2022). OmpSs-2@Cluster: Distributed Memory Execution of Nested OpenMP-style Tasks. In: Cano, J., Trinder, P. (eds) Euro-Par 2022: Parallel Processing. Euro-Par 2022. Lecture Notes in Computer Science, vol 13440. Springer, Cham. https://doi.org/10.1007/978-3-031-12597-3_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-12597-3_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-12596-6

  • Online ISBN: 978-3-031-12597-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics