Multithreaded runtime framework for parallel and adaptive applications

Thomadakis, Polykarpos; Tsolakis, Christos; Chrisochoides, Nikos

doi:10.1007/s00366-022-01713-7

Multithreaded runtime framework for parallel and adaptive applications

Original Article
Published: 31 July 2022

Volume 38, pages 4675–4695, (2022)
Cite this article

Engineering with Computers Aims and scope Submit manuscript

Polykarpos Thomadakis ORCID: orcid.org/0000-0002-4299-570X¹,
Christos Tsolakis¹ &
Nikos Chrisochoides¹

228 Accesses
5 Citations
Explore all metrics

Abstract

This paper presents a new design of the Parallel Runtime Environment for Multi-computer Applications (PREMA). This framework provides large-scale applications with one-sided communication, remote method invocations and a global namespace on top of transparent object migrations for implicit load balancing, scheduling, and latency hiding through an easy-to-use interface, for exascale-era platforms. The framework has been augmented with multi-threading, separating communication and execution into different threads to provide asynchronous message reception and instant computation execution. It allows for implicit parallel shared and distributed memory computations and guarantees correctness through an interface for assigning access privileges to parallel tasks while monitoring the load of the system and performing migrations. Scheduling and load balancing are enhanced by introducing custom intra-node schedulers and the ability to perform concurrent migrations. The motivation for the development of the runtime system is to provide a dynamic runtime for adaptive and irregular parallel applications like adaptive mesh refinement. Evaluating the system on such an application indicates an overall performance improvement of up to 50%, compared to static load balancing, with an overhead of less than 1% when using up to 190 computing nodes (i.e., 5600 cores); an improvement achieved by retaining a better work-load distribution among the execution units. Evaluations with a communication-intensive application with static load balancing reveals that no significant overhead is added despite the additional bookkeeping needed to monitor the load of each processing element.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Toward runtime support for unstructured and dynamic exascale-era applications

Article 10 January 2023

A New Parallel Research Kernel to Expand Research on Dynamic Load-Balancing Capabilities

Comparing Runtime Systems with Exascale Ambitions Using the Parallel Research Kernels

Notes

In case of collisions a list is used to keep the colliding elements in the same entry of the table.

References

Barker K, Chernikov A, Chrisochoides N, Pingali K (2004) A load balancing framework for adaptive and asynchronous applications. IEEE Trans Parallel Distrib Syst 15:183–192
Article Google Scholar
Thomadakis P, Tsolakis C, Vogiatzis K, Kot A, Chrisochoides N (2018) Parallel software framework for large-scale parallel mesh generation and adaptation for cfd solvers. In: AIAA aviation forum 2018. Atlanta, Georgia June
von Eicken T, Culler DE, Goldstein SC, Schauser KE (1992) Active messages: a mechanism for integrated communication and computation. SIGARCH Comput Arch News 20:256–266
Article Google Scholar
Krishnamurthy A, Culler DE, Dusseau A, Goldstein SC, Lumetta S, von Eicken T, Yelick K (1993) Parallel programming in split-c. In: Proceedings of the 1993 ACM/IEEE conference on supercomputing, supercomputing ’93 (New York, NY, USA). Association for Computing Machinery, pp 262–273
Carlson WW, Draper JM, Culler D, Yelick K, Brooks E, Warren K, Livermore L (1999) Introduction to upc and language specification. 04
Slotnick J, Khodadoust A, Alonso J, Darmofal D, Gropp W, Lurie E, Mavriplis D (2014) CFD vision 2030 study: a path to revolutionary computational aerosciences. Tech. Rep. CR-2014-218178, Langley Research Center
Garner K, Thomadakis P, Kennedy T, Tsolakis C, Chrisochoides N (2019) On the end-user productivity of a pseudo-constrained parallel data refinement method for the advancing front local reconnection mesh generation software. In: AIAA aviation forum 2019. Dallas, Texas
Barker K, Chrisochoides N, Nave D, Dobellaere J, Pingali K (2002) Data movement and control substrate for parallel adaptive applications. Concurrency and computation: practice and experience, pp 77–105
Chrisochoides N, Barker K, Nave D, Hawblitzel C (2000) Mobile object layer: a runtime substrate for parallel adaptive and irregular computations. Adv Eng Softw 31:621–637
Article MATH Google Scholar
Fedorov A, Chrisochoides N (2004) Location management in object-based distributed computing. In: 2004 IEEE international conference on cluster computing (IEEE Cat. No.04EX935), pp 299–308
Nave D, Chrisochoides N, Chew L (2004) Guaranteed-quality parallel delaunay refinement for restricted polyhedral domains. Computational geometry, vol 28, no. 2, pp 191–215 (Special issue on the 18th annual symposium on computational geometry—SoCG2002)
Balasubramaniam M, Barker K, Banicescu I, Chrisochoides N, Pabico J, Carino R (2004) A novel dynamic load balancing library for cluster computing. In: Third international symposium on parallel and distributed computing/third international workshop on algorithms, models and tools for parallel computing on heterogeneous Networks, pp 346–353
Blumofe RD, Leiserson CE (1999) Scheduling multithreaded computations by work stealing. J ACM 46:720–748
Article MathSciNet MATH Google Scholar
Metcalfe RM, Boggs DR (1976) Ethernet: distributed packet switching for local computer networks. Commun ACM 19:395–404
Article MATH Google Scholar
Dechev D, Pirkelbauer P, Stroustrup B (2010) Understanding and effectively preventing the aba problem in descriptor-based lock-free designs. In: 2010 13th IEEE international symposium on object/component/service-oriented real-time distributed computing, pp 185–192
Chernikov A, Chrisochoides N (2006) Parallel guaranteed quality Delaunay uniform mesh refinement. SIAM J Sci Comput 28(5):1907–1926
Article MathSciNet MATH Google Scholar
Drakopoulos F, Tsolakis C, Chrisochoides NP (2019) Fine-grained speculative topological transformation scheme for local reconnection methods. AIAA J 57:4007–4018
Article Google Scholar
Computational infrastructure for geodynamics::software. https://geodynamics.org/cig/software/sw4/. Accessed 21 Nov 2021
Sw4lite. https://github.com/geodynamics/sw4lite. Accessed 23 Jan 2021 (2019)
Petersson N, Sjögreen B (2014) Sw4 v1.1 [software]
Exascale project (2019). Accessed 23 Jan 2020
D S et al (2001) Tests of 3d elastodynamic codes: final report for lifelines project 1a01. Tech. rep., Pacific Eartquake Engineering Center
Carlson WW, Draper JM (1995) Distributed data access in ac. SIGPLAN Not. 30:39–47
Article Google Scholar
Culler DE, Arpaci-Dusseau AC, Goldstein SC, Krishnamurthy A, Lumetta SS, von Eicken T, Yelick KA (1993) Parallel programming in split-c Supercomputing ’93. Proceedings, pp 262–273
Numrich RW, Reid J (1998) Co-array fortran for parallel programming. SIGPLAN Fortran Forum 17:1–31
Article Google Scholar
Nieplocha J, Palmer B, Tipparaju V, Krishnan M, Trease H, Aprà E (2006) Advances, applications and performance of the global arrays shared memory programming toolkit. Int J High Perform Comput Appl 20:203–231, 06
Article Google Scholar
Yelick KA, Semenzato L, Pike G, Miyamoto C, Liblit B, Krishnamurthy A, Hilfinger PN, Graham SL, Gay D, Colella P, Aiken A (1998) Titanium: a high performance java dialect. Concurr Pract Exp 10:825–836
Article Google Scholar
Chang C, Saltz J, Sussman A (1995) Chaos++: a runtime library for supporting distributed dynamic data structures. In: Parallel programming using C++
Chamberlain B, Callahan D, Zima H (2007) Parallel programmability and the chapel language. Int J High Perform Comput Appl 21:291–312
Article Google Scholar
Charles P, Grothoff C, Saraswat V, Donawa C, Kielstra A, Ebcioglu K, von Praun C, Sarkar V (2005) X10: an object-oriented approach to non-uniform cluster computing. SIGPLAN Not. 40:519–538
Article Google Scholar
Kaiser H, Heller T, Adelstein-Lelbach B, Serio A, Fey D (2014) Hpx: a task based programming model in a global address space. In: Proceedings of the 8th international conference on partitioned global address space programming models, PGAS ’14, (New York, NY, USA), pp 6:1–6:11, ACM
Amini P (2020) Adaptive data migration in load-imbalanced HPC applications. PhD thesis, Louisiana State University and Agricultural and Mechanical College
Kale LV, Krishnan S (1993) Charm++: a portable concurrent object oriented system based on c++. SIGPLAN Not. 28:91–108
Article Google Scholar
Mattson TG, Cledat R, Cavé V, Sarkar V, Budimlić Z, Chatterjee S, Fryman J, Ganev I, Knauerhase R, Lee M, Meister B, Nickerson B, Pepperling N, Seshasayee B, Tasirlar S, Teller J, Vrvilo N (2016) The open community runtime: a runtime system for extreme scale computing. In: 2016 IEEE high performance extreme computing conference (HPEC), pp 1–7
Bauer M, Treichler S, Slaughter E, Aiken A (2012) Legion: expressing locality and independence with logical regions. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, SC ’12, (Los Alamitos, CA, USA), pp 66:1–66:11, IEEE Computer Society Press
Kumar S, Dózsa G, Almási G, Heidelberger P, Chen D, Giampapa ME, Blocksome M, Faraj A, Parker J, Ratterman J, Smith BE, Archer CJ (2008) The deep computing messaging framework: generalized scalable message passing on the Blue Gene/P supercomputer. In: ICS ’08
Shah G, Nieplocha J, Mirza H, Kim C, Harrison R, Govindaraju R, Gildea K, DiNicola P, Bender C (1998) Performance and experience with LAPI—a new high-performance communication library for the ibm rs/6000 sp. In: Proceedings of the first merged international parallel processing symposium and symposium on parallel and distributed processing, pp 260 – 266, 01
Bonachea D, Hargrove PH (2019) Gasnet-ex: a high-performance, portable communication library for exascale. In: Hall M, Sundar H (eds) Languages and compilers for parallel computing. Springer, Cham, pp 138–158
Chapter Google Scholar
Pope AL (1998) The CORBA reference guide: understanding the common object request broker architecture. Addison-Wesley Longman Publishing Co., Inc, USA
Google Scholar
Waldo J (1998) Remote procedure calls and java remote method invocation. IEEE Concurr 6(3):5–7
Article Google Scholar
Willcock JJ, Hoefler T, Edmonds NG, Lumsdaine A (2010) AM++: a generalized active message framework. In: Proceedings of the 19th international conference on parallel architectures and compilation techniques, PACT ’10, (New York, NY, USA). Association for Computing Machinery, pp 401–410
Thomas N, Saunders S, Smith T, Tanase G, Rauchwerger L (2006) ARMI: a high level communication library for STAPL. Parallel Process Lett 16:261–280
Article MathSciNet Google Scholar
...Seo S, Amer A, Balaji P, Bordage C, Bosilca G, Brooks A, Carns P, Castelló A, Genet D, Herault T, Iwasaki S, Jindal P, Kalé LV, Krishnamoorthy S, Lifflander J, Lu H, Meneses E, Snir M, Sun Y, Taura K, Beckman P (2018) Argobots: a lightweight low-level threading and tasking framework. IEEE Trans Parallel Distrib Syst 29(3):512–526
Article Google Scholar
Kot A, Chernikov A, Chrisochoides N (2011) The evaluation of an effective out-of-core run-time system in the context of parallel mesh generation. In: IEEE international parallel and distributed processing symposium, pp 164–175

Download references

Acknowledgements

This work is funded in part by the Dominion Fellowship, the Richard T. Cheng Endowment at Old Dominion University and NSF Grant no: CNS-1828593.

Author information

Authors and Affiliations

CRTC, Department of Computer Science, Old Dominion University, Norfolk, VA, USA
Polykarpos Thomadakis, Christos Tsolakis & Nikos Chrisochoides

Authors

Polykarpos Thomadakis
View author publications
You can also search for this author inPubMed Google Scholar
Christos Tsolakis
View author publications
You can also search for this author inPubMed Google Scholar
Nikos Chrisochoides
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Polykarpos Thomadakis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A. Appendix

The goal of this appendix is to present the (simplified) implementation of two load balancing strategies, Master–Worker and Diffusion, utilizing PREMA’s scheduler API. The two strategies have been used to scale different irregular applications ([7], Sect. 5.3.2), while significantly reducing code complexity (removing load balancing-related code) and line count (e.g., 1200 vs 2500 LOC in the first application) compared to the respective MPI implementations.

1.1 A.1 Master–worker

Figure 12 presents the simplified master–worker implementation. The derived class assigns a single node as the master and defines a load threshold under which a worker node is considered underloaded (line 3). Each node keeps a custom map (provided by PREMA) that holds mobile objects along with their workload and tracks the overall node workload. When a worker finds its load under the threshold, it sends a remote request to the master for a new mobile object migration (lines 5–7). If the master holds enough load, it picks a mobile object, packs it, and sends it to the requesting worker (lines 27–30). Otherwise, it requests the worker to wait and pushes its rank to a list of waiting workers (lines 33–34). In the reception of the master’s reply, the worker unpacks and installs the packed object to the local node, which updates PREMA about the migration(lines 38–41). If there is no mobile object to unpack, the worker sets a flag that it should wait from the master for a new workload when it becomes available (line 44). In this simplified case, we use a simple list to maintain handler invocation requests and support the push()/pop() operations; a more sophisticated implementation could use work pools per thread, per mobile object, or a combination of the two. Method notify() keeps the mobile objects—load map and node workload—up to date and is called each time the node workload changes.

1.2 A.2 Diffusion

Figure 13 presents the simplified diffusive scheme implementation. In this scheme, each node assigns a “neighborhood” of other nodes from which it can request workload. In each new load balancing phase, the underloaded node tries to steal from the node with the largest workload in the neighborhood. If no neighbor has enough workload, a new neighborhood is assigned for the next load balancing phase. In this implementation, dist_balance() checks if the node is underloaded and initiates a new load balancing phase by requesting the workload levels of its neighborhood. Once the underloaded node receives all the responses, it chooses the neighbor with the highest load and requests for mobile object migration or assigns a new neighborhood if not enough workload exists (lines 25–36). The receiver of a migration request picks its mobile object with the largest workload and, if its workload is enough, packs and sends it to the underloaded node. Otherwise, it refuses to migrate any work (lines 39–44). Depending on this response, the requester will either unpack and install the received mobile object or replace the neighbor in the neighborhood set and prepare for a new load balancing phase.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Thomadakis, P., Tsolakis, C. & Chrisochoides, N. Multithreaded runtime framework for parallel and adaptive applications. Engineering with Computers 38, 4675–4695 (2022). https://doi.org/10.1007/s00366-022-01713-7

Download citation

Received: 13 December 2021
Accepted: 10 July 2022
Published: 31 July 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s00366-022-01713-7

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multithreaded runtime framework for parallel and adaptive applications

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Toward runtime support for unstructured and dynamic exascale-era applications

A New Parallel Research Kernel to Expand Research on Dynamic Load-Balancing Capabilities

Comparing Runtime Systems with Exascale Ambitions Using the Parallel Research Kernels

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A. Appendix

A. Appendix

1.1 A.1 Master–worker

1.2 A.2 Diffusion

Rights and permissions

About this article

Cite this article

Share this article

Subscribe and save

Buy Now