OmpSs-2@Cluster: Distributed Memory Execution of Nested OpenMP-style Tasks

Aguilar Mena, Jimmy; Shaaban, Omar; Beltran, Vicenç; Carpenter, Paul; Ayguade, Eduard; Labarta Mancho, Jesus

doi:10.1007/978-3-031-12597-3_20

Jimmy Aguilar Mena⁹,
Omar Shaaban⁹,
Vicenç Beltran⁹,
Paul Carpenter⁹,
Eduard Ayguade⁹ &
…
Jesus Labarta Mancho⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13440))

Included in the following conference series:

European Conference on Parallel Processing

1487 Accesses
5 Citations
1 Altmetric

Abstract

State-of-the-art programming approaches generally have a strict division between intra-node shared memory parallelism and inter-node MPI communication. Tasking with dependencies offers a clean, dependable abstraction for a wide range of hardware and situations within a node, but research on task offloading between nodes is still relatively immature. This paper presents a flexible task offloading extension of the OmpSs-2 programming model, which inherits task ordering from a sequential version of the code and uses a common address space to avoid address translation and simplify the use of data structures with pointers. It uses weak dependencies to enable work to be created concurrently. The program is executed in distributed dataflow fashion, and the runtime system overlaps the construction of the distributed dependency graph, enforces dependencies, transfers data, and schedules tasks for execution. Asynchronous task parallelism avoids synchronization that is often required in MPI+OpenMP tasks. Task scheduling is flexible, and data location is tracked through the dependencies. We wish to enable future work in resiliency, scalability, load balancing and malleability, and therefore release all source code and examples open source.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Mitigating the NUMA effect on task-based runtime systems

Article 06 April 2023

Enhancing OpenMP Tasking Model: Performance and Portability

A Low Overhead Tasking Model for OpenMP

Notes

1.
Currently nowait is available through a Nanos6 API call.

References

Aguilar Mena, J., Shaaban, O., Beltran, V., Carpenter, P., Ayguade, E., Labarta Mancho, J.: Artifact and instructions to generate experimental results for the Euro-Par 2022 paper: “OmpSs-2@Cluster: Distributed memory execution of nested OpenMP-style tasks” (2022). https://doi.org/10.6084/m9.figshare.19960721
Álvarez, D., Sala, K., Maroñas, M., Roca, A., Beltran, V.: Advanced synchronization techniques for task-based runtime systems, pp. 334–347. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3437801.3441601
Augonnet, C., Aumage, O., Furmento, N., Namyst, R., Thibault, S.: StarPU-MPI: task programming over clusters of machines enhanced with accelerators. In: European MPI Users’ Group Meeting, pp. 298–299. Springer, Berlin Heidelberg (2012). https://doi.org/10.1007/978-3-642-33518-1_40
Barcelona Supercomputing Center: MareNostrum 4 (2017) System Architecture (2017). https://www.bsc.es/marenostrum/marenostrum/technical-information
Barcelona Supercomputing Center: Mercurium (2021). https://pm.bsc.es/mcxx
Barcelona Supercomputing Center: Nanos6 (2021). https://github.com/bsc-pm/nanos6
Barcelona Supercomputing Center: OmpSs-2 releases (2021). https://github.com/bsc-pm/ompss-releases
Barcelona Supercomputing Center: OmpSs-2 specification (2021). https://pm.bsc.es/ftp/ompss-2/doc/spec/
Barcelona Supercomputing Center: OmpSs-2@Cluster releases (2022). https://github.com/bsc-pm/ompss-2-cluster-releases
Bauer, M., Treichler, S., Slaughter, E., Aiken, A.: Legion: expressing locality and independence with logical regions. In: SC ’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2012). https://doi.org/10.1109/SC.2012.71
Beckingsale, D.A., et al.: Raja: portable performance for large-scale scientific applications. In: IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) (2019). https://doi.org/10.1109/P3HPC49587.2019.00012
Blumofe, R., Joerg, C., Kuszmaul, B., Leiserson, C., Randall, K., Zhou, Y.: Cilk: an efficient multithreaded runtime system. J. Parallel Distrib. Comput. 37 (1999). https://doi.org/10.1006/jpdc.1996.0107
Bueno, J., et al.: Productive programming of GPU clusters with OmpSs. In: IEEE 26th International Parallel and Distributed Processing Symposium (2012). https://doi.org/10.1109/IPDPS.2012.58
Charles, P., et al.: X10: an object-oriented approach to non-uniform cluster computing. In: Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pp. 519–538. OOPSLA 2005. Association for Computing Machinery, New York, NY, USA (2005). https://doi.org/10.1145/1094811.1094852
Deelman, E., et al.: Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci. Programm. 13(3), 219–237 (2005). https://doi.org/10.1155/2005/128026
Faxén, K.F.: Wool user’s guide. Technical report, Swedish Institute of Computer Science (2009)
Google Scholar
Fürlinger, K., et al.: DASH: distributed data structures and parallel algorithms in a global address space. In: Software for Exascale Computing-SPPEXA 2016–2019, pp. 103–142. Springer International Publishing (2020). https://doi.org/10.1007/978-3-030-47956-5_6
Hoque, R., Herault, T., Bosilca, G., Dongarra, J.: Dynamic task discovery in parsec: a data-flow task-based runtime. In: Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, pp. 1–8 (2017). https://doi.org/10.1145/3148226.3148233
Kaiser, H., Heller, T., Adelstein-Lelbach, B., Serio, A., Fey, D.: HPX: a task based programming model in a global address space. In: 8th International Conference on Partitioned Global Address Space Programming Models (2014). https://doi.org/10.13140/2.1.2635.5204
Klinkenberg, J., Samfass, P., Bader, M., Terboven, C., Müller, M.: CHAMELEON: reactive load balancing for hybrid MPI+OpenMP task-parallel applications. J. Parallel Distrib. Comput. 138 (2019). https://doi.org/10.1016/j.jpdc.2019.12.005
Leiserson, C.E.: The Cilk++ concurrency platform. J. Supercomput. 51(3), 244–257 (2010). https://doi.org/10.1007/s11227-010-0405-3
Article Google Scholar
Lordan, F., et al.: ServiceSs: an interoperable programming framework for the cloud. J. Grid Comput. 12(1), 67–91 (2013). https://doi.org/10.1007/s10723-013-9272-5
OpenMP Architecture Review Board: OpenMP 4.0 complete specifications, July 2013
Google Scholar
Papakonstantinou, N., Zakkak, F.S., Pratikakis, P.: Hierarchical parallel dynamic dependence analysis for recursively task-parallel programs. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 933–942 (2016). https://doi.org/10.1109/IPDPS.2016.53
Parallel Programming Lab, Dept of Computer Science, U.o.I.: Charm++ documentation. https://charm.readthedocs.io/en/latest/index.html
Perez, J.M., Beltran, V., Labarta, J., Ayguadé, E.: Improving the integration of task nesting and dependencies in OpenMP. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 809–818 (2017). https://doi.org/10.1109/IPDPS.2017.69
Pérez, J., Badia, R.M., Labarta, J.: A dependency-aware task-based programming environment for multi-core architectures. In: Proceedings - IEEE International Conference on Cluster Computing, ICCC, pp. 142–151 (2008). https://doi.org/10.1109/CLUSTR.2008.4663765
Rotaru, T., Rahn, M., Pfreundt, F.J.: MapReduce in GPI-Space. In: Euro-Par 2013: Parallel Processing Workshops, pp. 43–52. Springer, Berlin, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54420-0_5
Sala, K., Macià, S., Beltran, V.: Combining one-sided communications with task-based programming models. In: 2021 IEEE International Conference on Cluster Computing (CLUSTER), pp. 528–541 (2021). https://doi.org/10.1109/Cluster48925.2021.00024
Sala, K., Teruel, X., Perez, J.M., Peña, A.J., Beltran, V., Labarta, J.: Integrating blocking and non-blocking MPI primitives with task-based programming models. Parallel Comput. 85, 153–166 (2019). https://doi.org/10.1016/j.parco.2018.12.008
Article Google Scholar
Sergent, M., Goudin, D., Thibault, S., Aumage, O.: Controlling the memory subscription of distributed applications with a task-based runtime system. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 318–327 (2016). https://doi.org/10.1109/IPDPSW.2016.105
Tillenius, M.: SuperGlue: a shared memory framework using data versioning for dependency-aware task-based parallelization. SIAM J. Scient. Comput. 37(6) (2015). https://doi.org/10.1137/140989716
Tzenakis, G., Papatriantafyllou, A., Kesapides, J., Pratikakis, P., Vandierendonck, H., Nikolopoulos, D.S.: BDDT: Block-level dynamic dependence analysis for deterministic task-based parallelism. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP 2012, vol. 47, pp. 301–302. Association for Computing Machinery, New York, NY, USA (2012). https://doi.org/10.1145/2145816.2145864
Virouleau, P., Broquedis, F., Gautier, T., Rastello, F.: Using data dependencies to improve task-based scheduling strategies on NUMA architectures. In: European Conference on Parallel Processing, pp. 531–544. Springer (2016). https://doi.org/10.1007/978-3-319-43659-3_39
Zafari, A., Larsson, E., Tillenius, M.: DuctTeip: an efficient programming model for distributed task-based parallel computing. Parallel Comput. (2019). https://doi.org/10.1016/j.parco.2019.102582
Article MathSciNet Google Scholar

Download references

Acknowledgements and Data Availability

The datasets and code generated during and/or analyzed during the current study are available in the Figshare repository: [1]. This research has received funding from the European Union’s Horizon 2020/EuroHPC research and innovation programme under grant agreement No 955606 (DEEP-SEA) and 754337 (EuroEXA). It is supported by the Spanish State Research Agency - Ministry of Science and Innovation (contract PID2019-107255GB and Ramon y Cajal fellowship RYC2018-025628-I) and by the Generalitat de Catalunya (2017-SGR-1414).

Author information

Authors and Affiliations

Barcelona Supercomputing Center, Barcelona, Spain
Jimmy Aguilar Mena, Omar Shaaban, Vicenç Beltran, Paul Carpenter, Eduard Ayguade & Jesus Labarta Mancho

Authors

Jimmy Aguilar Mena
View author publications
You can also search for this author in PubMed Google Scholar
Omar Shaaban
View author publications
You can also search for this author in PubMed Google Scholar
Vicenç Beltran
View author publications
You can also search for this author in PubMed Google Scholar
Paul Carpenter
View author publications
You can also search for this author in PubMed Google Scholar
Eduard Ayguade
View author publications
You can also search for this author in PubMed Google Scholar
Jesus Labarta Mancho
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jimmy Aguilar Mena .

Editor information

Editors and Affiliations

University of Glasgow, Glasgow, UK
José Cano
University of Glasgow, Glasgow, UK
Phil Trinder

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aguilar Mena, J., Shaaban, O., Beltran, V., Carpenter, P., Ayguade, E., Labarta Mancho, J. (2022). OmpSs-2@Cluster: Distributed Memory Execution of Nested OpenMP-style Tasks. In: Cano, J., Trinder, P. (eds) Euro-Par 2022: Parallel Processing. Euro-Par 2022. Lecture Notes in Computer Science, vol 13440. Springer, Cham. https://doi.org/10.1007/978-3-031-12597-3_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-12597-3_20
Published: 01 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-12596-6
Online ISBN: 978-3-031-12597-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

OmpSs-2@Cluster: Distributed Memory Execution of Nested OpenMP-style Tasks

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Mitigating the NUMA effect on task-based runtime systems

Enhancing OpenMP Tasking Model: Performance and Portability

A Low Overhead Tasking Model for OpenMP

Notes

References

Acknowledgements and Data Availability

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

OmpSs-2@Cluster: Distributed Memory Execution of Nested OpenMP-style Tasks

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Mitigating the NUMA effect on task-based runtime systems

Enhancing OpenMP Tasking Model: Performance and Portability

A Low Overhead Tasking Model for OpenMP

Notes

References

Acknowledgements and Data Availability

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation