Improved MPI collectives for MPI processes in shared address spaces

Li, Shigang; Hoefler, Torsten; Hu, Chungjin; Snir, Marc

doi:10.1007/s10586-014-0361-4

Improved MPI collectives for MPI processes in shared address spaces

Published: 19 March 2014

Volume 17, pages 1139–1155, (2014)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Shigang Li¹,
Torsten Hoefler²,
Chungjin Hu¹ &
…
Marc Snir³

615 Accesses
19 Citations
Explore all metrics

Abstract

As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimization of MPI collectives for clusters of NUMA nodes. We develop performance models for collective communication using shared memory and we demonstrate several algorithms for various collectives. Experiments are conducted on both Xeon X5650 and Opteron 6100 InfiniBand clusters. The measurements agree with the model and indicate that different algorithms dominate for short vectors and long vectors. We compare our shared-memory allreduce with several MPI implementations—Open MPI, MPICH2, and MVAPICH2—that utilize system shared memory to facilitate interprocess communication. On a 16-node Xeon cluster and 8-node Opteron cluster, our implementation achieves on geometric average 2.3X and 2.1X speedup over the best MPI implementation, respectively. Our techniques enable an efficient implementation of collective operations on future multi- and manycore systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Orthrus: A Framework for Implementing Efficient Collective I/O in Multi-core Clusters

Dynamic Placement of Progress Thread for Overlapping MPI Non-blocking Collectives on Manycore Processor

Sparbit: Towards to a Logarithmic-Cost and Data Locality-Aware MPI Allgather Algorithm

Article 16 March 2023

References

AMD: Software optimization guide for AMD family 15h processors (2012).
Aulwes, R., Daniel, D., Desai, N., Graham, R., Risinger, L., Taylor, M., Woodall, T., Sukalski, M.: Architecture of LA-MPI, a network-fault-tolerant MPI. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium, p. 15 (2004).
Blagojević, F., Hargrove, P., Iancu, C., Yelick, K.: Hybrid PGAS runtime support for multicore nodes. In: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model PGAS ’10, pp. 3:1–3:10. ACM (2010).
Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: hwloc: A generic framework for managing hardware affinities in HPC applications. In: Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing PDP ’10, pp. 180–186. IEEE Computer Society (2010).
Feind, K., McMahon, K.: An ultrahigh performance MPI implementation on SGI ccNUMA Altix systems. Comput. Methods Sci. Technol., 67–70 (2006).
Friedley, A., Bronevetsky, G., Lumsdaine, A., Hoefler, T.: Hybrid MPI: efficient message passing for multi-core systems. In: Proceedings of the SC13 IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (2013).
Friedley, A., Hoefler, T., Bronevetsky, G., Lumsdaine, A., Ma, C.C.: Ownership passing: efficient distributed memory programming on multi-core systems. In: Proceedings of the 18th ACM symposium on Principles and Practice of Parallel Programming PPoPP’13 (Accepted) (2013).
Graham, R.L., Shipman, G.: MPI support for multi-core architectures: optimized shared memory collectives. In: Proceedings of the 15th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 130–140. Springer, Berlin (2008).
Hensgen, D., Finkel, R., Manber, U.: Two algorithms for barrier synchronization. Int. J. Parallel Program. 17(1), 1–17 (1988)
Article MATH Google Scholar
Hoefler, T., Mehlan, T., Mietke, F., Rehm, W.: Fast barrier synchronization for InfiniBand. In: Proceedings of the 20th International Parallel and Distributed Processing Symposium IPDPS (2006).
Intel: Intel 64 and IA-32 Architectures optimization reference manual (2012).
Kamal, H., Wagner, A.: Fg-mpi: fine-grain mpi for multicore and clusters. In: Proceedings of the IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–8 (2010).
Kielmann, T., Hofman, R.F.H., Bal, H.E., Plaat, A., Bhoedjang, R.A.F.: MagPIe: MPI’s collective communication operations for clustered wide area systems. In: Proceedings of the seventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming PPoPP ’99, pp. 131–140. ACM, New York (1999)
Li, S., Hoefler, T., Snir, M.: Numa-aware shared-memory collective communication for mpi. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing HPDC ’13, pp. 85–96. ACM, New York (2013)
Mamidala, A., Kumar, R., De, D., Panda, D.: MPI collectives on modern multicore clusters: performance optimizations and communication characteristics. In: Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid CCGRID ’08, pp. 130–137 (2008).
Mellor-Crummey, J.M., Scott, M.L.: Synchronization without contention. SIGPLAN Notice 26(4), 269–278 (1991)
Article Google Scholar
Molka, D., Hackenberg, D., Schone, R., Muller, M.S.: Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In: Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques PACT ’09, pp. 261–270. IEEE Computer Society, Washington (2009)
MPI Forum: MPI: A Message-passing interface standard. version 2.2 (2009).
Negara, S., Zheng, G., Pan, K.C., Negara, N., Johnson, R.E., Kalé, L.V., Ricker, P.M.: Automatic MPI to AMPI program transformation using photran. In: Proceedings of the Conference on Parallel Processing Euro-Par, pp. 531–539. Springer, Berlin (2011)
Board, OpenMP Architecture Review: Application program interface version 3, 1 (2011)
Pérache, M., Carribault, P., Jourdren, H.: MPC-MPI: an MPI implementation reducing the overall memory consumption. In: Proceedings of the 16th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 94–103. Springer, Berlin (2009).
Rabenseifner, R.: Optimization of collective reduction operations. Computational Science-ICCS pp. 1–9. Springer, Berlin (2004).
Sistare, S., Vaart, R., Loh, E.: Optimization of MPI collectives on clusters of large-scale SMP’s. In: Proceedings of the ACM/IEEE 1999 Conference on Supercomputing (1999).
Tang, H., Shen, K., Yang, T.: Program transformation and runtime support for threaded MPI execution on shared-memory machines. ACM Trans. Program. Lang. Syst. (TOPLAS) 22(4), 673–700 (2000)
Article Google Scholar
Tang, H., Yang, T.: Optimizing threaded MPI execution on SMP clusters. In: Proceedings of the 15th International Conference on Supercomputing ICS ’01, pp. 381–392. ACM (2001).
Thakur, R., Gropp, W.: Improving the performance of collective operations in MPICH. In: Proceedings of the 10th European PVM/MPI User’s Group Meeting in Recent Advances in Parallel Virtual Machine and Message Passing Interface. Lecture Notes in Computer Science, vol. 2840, pp. 257–267. Springer, Berlin (2003).
Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19, 49–66 (2005)
Article Google Scholar
Tipparaju, V., Nieplocha, J., Panda, D.: Fast collective operations using shared and remote memory access protocols on clusters. In: Proceedings of the International IEEE on the Parallel and Distributed Processing Symposium, (2003).
Yew, P.C., Tzeng, N.F., Lawrie, D.: Distributing hot-spot addressing in large-scale multiprocessors. IEEE Trans. Comput. 36(4), 388–395 (1987)
Google Scholar
Zhang, E.Z., Jiang, Y., Shen, X.: Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming PPoPP ’10, pp. 203–212. ACM (2010).
Zhang, J., Behzad, B., Snir, M.: Optimizing the BarnesspsHut algorithm in UPC. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis SC ’11, pp. 75:1–75:11. ACM (2011).
Zhu, H., Goodell, D., Gropp, W., Thakur, R.: Hierarchical collectives in MPICH2. In: Proceedings of the 16th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 325–326. Springer, Berlin (2009).

Download references

Acknowledgments

The work is supported in part by the DOE Office of Science, Advanced Scientific Computing Research, under Award number DE-FC02-10ER26011 and DOE Office of Science, Advanced Scientific Computing Research, under Award number DE-AC02-06CH11357. Li is supported in part by National Key Basic Research and Development Program of China under No.2013CB329605 and No.2013CB329606, and Key Project of the National 25th Year Research Program of China under No.2011BAK08B04. This work was supported in part by the DOE Office of Science, Advanced Scientific Computing Research, under Award number DE-FC02-10ER26011, program manager Lucy Nowell.

Author information

Authors and Affiliations

School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China
Shigang Li & Chungjin Hu
Department of Computer Science, ETH Zurich, Zurich, Switzerland
Torsten Hoefler
Department of Computer Science, University of Illinois at Urbana-Champaign and Argonne National Laboratory, Champaign, IL, USA
Marc Snir

Authors

Shigang Li
View author publications
You can also search for this author in PubMed Google Scholar
Torsten Hoefler
View author publications
You can also search for this author in PubMed Google Scholar
Chungjin Hu
View author publications
You can also search for this author in PubMed Google Scholar
Marc Snir
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shigang Li.

Additional information

Shigang Li is currently a visiting graduate student at Department of Computer Science, University of Illinois at Urbana-Champaign.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, S., Hoefler, T., Hu, C. et al. Improved MPI collectives for MPI processes in shared address spaces. Cluster Comput 17, 1139–1155 (2014). https://doi.org/10.1007/s10586-014-0361-4

Download citation

Received: 29 September 2013
Revised: 17 February 2014
Accepted: 18 February 2014
Published: 19 March 2014
Issue Date: December 2014
DOI: https://doi.org/10.1007/s10586-014-0361-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improved MPI collectives for MPI processes in shared address spaces

Abstract

Access this article

Similar content being viewed by others

Orthrus: A Framework for Implementing Efficient Collective I/O in Multi-core Clusters

Dynamic Placement of Progress Thread for Overlapping MPI Non-blocking Collectives on Manycore Processor

Sparbit: Towards to a Logarithmic-Cost and Data Locality-Aware MPI Allgather Algorithm

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improved MPI collectives for MPI processes in shared address spaces

Abstract

Access this article

Similar content being viewed by others

Orthrus: A Framework for Implementing Efficient Collective I/O in Multi-core Clusters

Dynamic Placement of Progress Thread for Overlapping MPI Non-blocking Collectives on Manycore Processor

Sparbit: Towards to a Logarithmic-Cost and Data Locality-Aware MPI Allgather Algorithm

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation