An integrated fine-grain runtime system for MPI

Kamal, Humaira; Wagner, Alan

doi:10.1007/s00607-013-0329-x

An integrated fine-grain runtime system for MPI

Published: 08 May 2013

Volume 96, pages 293–309, (2014)
Cite this article

Computing Aims and scope Submit manuscript

Humaira Kamal¹ &
Alan Wagner¹

301 Accesses
10 Citations
Explore all metrics

Abstract

Fine-grain MPI (FG-MPI) extends the execution model of MPI to allow for interleaved execution of multiple concurrent MPI processes inside an OS-process. It provides a runtime that is integrated into the MPICH2 middleware and uses light-weight coroutines to implement an MPI-aware scheduler. In this paper we describe the FG-MPI runtime system and discuss the main design issues in its implementation. FG-MPI enables expression of function-level parallelism, which along with a runtime scheduler, can be used to simplify MPI programming and achieve performance without adding complexity to the program. As an example, we use FG-MPI to re-structure a typical use of non-blocking communication and show that the integrated scheduler relieves the programmer from scheduling computation and communication inside the application and brings the performance part outside of the program specification into the runtime.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Notes

We will be using the terms “node” and “machine” interchangeably in this paper to refer to a single computational node with multiple processor cores, operating under a single operating system.
MPI processes sharing the same address space are referred to as co-located processes.
We do not support MPI dynamic process management functionality.

References

Argonne National Laboratory, USA (2007) MPICH2: Performance and portability. MPICH2 flyer at Super, Computing SC07
Balaji P, Goodell D (2008) Using 32-bit as rank. Available from https://trac.mcs.anl.gov/projects/mpich2/ticket/42/. Accessed 5 April 2013
Balaji P, Buntinas D, Goodell D, Gropp W, Thakur R (2008) Toward efficient support for multithreaded MPI communication. In: Proceedings of the 15th Euro PVM/MPI Users’ Group Meeting, Springer, Berlin, pp 120–129
Balaji P, Buntinas D, Goodell D, Gropp W, Kumar S, Lusk EL, Thakur R, Träff JL (2009) MPI on a million processors. In: PVM/MPI, pp 20–30
Buntinas D, Gropp W, Mercier G (2006) Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem. In: Proceedings of the 6th IEEE International Symposium on Cluster Computing and the Grid, pp 521–530
Demaine E (1997) A threads-only MPI implementation for the development of parallel programs. In: Proceedings of the 11th International Symposium on High Performance, Computing Systems, pp 153–163
Ferreira KB, Bridges P, Brightwell R (2008) Characterizing application sensitivity to OS interference using kernel-level noise injection. In: Proceedings of the (2008) ACM/IEEE conference on Supercomputing. IEEE Press, vol 12, pp 19(1–19)
Gropp W (2001) Learning from the success of MPI. In: Proceedings of the 8th International Conference on High Performance Computing, Springer, HiPC ’01, pp 81–94
Gropp W, Lusk E, Skjellum A (1999) Using MPI - 2nd Edition: Portable Parallel Programming with the Message Passing Interface. MIT Press, Scientific and Engineering Computation Series
Huang C, Lawlor OS, Kale LV (2003) Adaptive MPI. In: Languages and Compilers for Parallel Computing, 16th International Workshop. Revised Papers, Springer, Lecture Notes in Computer Science, vol 2958, pp 306–322
Kamal H, Wagner A (2012) Added concurrency to improve MPI performance on multicore. In: 41st International Conference on Parallel Processing (ICPP), pp 229–238
Kamal H, Mirtaheri SM, Wagner A (2010) Scalability of communicators and groups in MPI. In: Proc. of the 19th ACM Intl. Symposium on High Performance Distributed Computing, ACM, New York, USA, HPDC ’10, pp 264–275
Marjanović V, Labarta J, Ayguadé E, Valero M (2010) Overlapping communication and computation by using a hybrid MPI/SMPSs approach. In: Proc. of the 24th ACM International Conference on Supercomputing, ACM, New York, pp 5–16
Saltzer J (1993) On the naming and binding of network destinations. Network Working Group. http://tools.ietf.org/html/rfc1498. Accessed 5 April 2013
Tang H, Yang T (2001) Optimizing threaded MPI execution on SMP clusters. In: ICS ’01: Proceedings of 15th International Conference on Supercomputing, ACM, New York, pp 381–392
Thakur R, Gropp W (2007) Test suite for evaluating performance of MPI implementations that support \(MPI\_THREAD\_MULTIPLE.\) In: PVM/MPI, pp 46–55
Träff JL (2010) Compact and efficient implementation of the MPI group operations. In: Proc. of the 17th EuroMPI conference, Springer, Berlin, Heidelberg, pp 170–178
Von Behren R, Condit J, Zhou F, Necula G, Brewer E (2003) Capriccio: scalable threads for Internet services. In: SOSP ’19, ACM, New York, pp 268–281

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of British Columbia, Vancouver, Canada
Humaira Kamal & Alan Wagner

Authors

Humaira Kamal
View author publications
You can also search for this author in PubMed Google Scholar
Alan Wagner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Humaira Kamal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kamal, H., Wagner, A. An integrated fine-grain runtime system for MPI. Computing 96, 293–309 (2014). https://doi.org/10.1007/s00607-013-0329-x

Download citation

Received: 12 December 2012
Accepted: 26 April 2013
Published: 08 May 2013
Issue Date: April 2014
DOI: https://doi.org/10.1007/s00607-013-0329-x

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

An integrated fine-grain runtime system for MPI

Abstract

Access this article

Similar content being viewed by others

Efficient High-Level Programming in Plain Java

TB-TBP: a task-based adaptive routing algorithm for network-on-chip in heterogenous CPU-GPU architectures

Parallel programming models for heterogeneous many-cores: a comprehensive survey

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

An integrated fine-grain runtime system for MPI

Abstract

Access this article

Similar content being viewed by others

Efficient High-Level Programming in Plain Java

TB-TBP: a task-based adaptive routing algorithm for network-on-chip in heterogenous CPU-GPU architectures

Parallel programming models for heterogeneous many-cores: a comprehensive survey

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation