Skip to main content
Log in

Improved MPI collectives for MPI processes in shared address spaces

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimization of MPI collectives for clusters of NUMA nodes. We develop performance models for collective communication using shared memory and we demonstrate several algorithms for various collectives. Experiments are conducted on both Xeon X5650 and Opteron 6100 InfiniBand clusters. The measurements agree with the model and indicate that different algorithms dominate for short vectors and long vectors. We compare our shared-memory allreduce with several MPI implementations—Open MPI, MPICH2, and MVAPICH2—that utilize system shared memory to facilitate interprocess communication. On a 16-node Xeon cluster and 8-node Opteron cluster, our implementation achieves on geometric average 2.3X and 2.1X speedup over the best MPI implementation, respectively. Our techniques enable an efficient implementation of collective operations on future multi- and manycore systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31
Fig. 32
Fig. 33
Fig. 34
Fig. 35
Fig. 36
Fig. 37
Fig. 38
Fig. 39
Fig. 40

Similar content being viewed by others

References

  1. AMD: Software optimization guide for AMD family 15h processors (2012).

  2. Aulwes, R., Daniel, D., Desai, N., Graham, R., Risinger, L., Taylor, M., Woodall, T., Sukalski, M.: Architecture of LA-MPI, a network-fault-tolerant MPI. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium, p. 15 (2004).

  3. Blagojević, F., Hargrove, P., Iancu, C., Yelick, K.: Hybrid PGAS runtime support for multicore nodes. In: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model PGAS ’10, pp. 3:1–3:10. ACM (2010).

  4. Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: hwloc: A generic framework for managing hardware affinities in HPC applications. In: Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing PDP ’10, pp. 180–186. IEEE Computer Society (2010).

  5. Feind, K., McMahon, K.: An ultrahigh performance MPI implementation on SGI ccNUMA Altix systems. Comput. Methods Sci. Technol., 67–70 (2006).

  6. Friedley, A., Bronevetsky, G., Lumsdaine, A., Hoefler, T.: Hybrid MPI: efficient message passing for multi-core systems. In: Proceedings of the SC13 IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (2013).

  7. Friedley, A., Hoefler, T., Bronevetsky, G., Lumsdaine, A., Ma, C.C.: Ownership passing: efficient distributed memory programming on multi-core systems. In: Proceedings of the 18th ACM symposium on Principles and Practice of Parallel Programming PPoPP’13 (Accepted) (2013).

  8. Graham, R.L., Shipman, G.: MPI support for multi-core architectures: optimized shared memory collectives. In: Proceedings of the 15th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 130–140. Springer, Berlin (2008).

  9. Hensgen, D., Finkel, R., Manber, U.: Two algorithms for barrier synchronization. Int. J. Parallel Program. 17(1), 1–17 (1988)

    Article  MATH  Google Scholar 

  10. Hoefler, T., Mehlan, T., Mietke, F., Rehm, W.: Fast barrier synchronization for InfiniBand. In: Proceedings of the 20th International Parallel and Distributed Processing Symposium IPDPS (2006).

  11. Intel: Intel 64 and IA-32 Architectures optimization reference manual (2012).

  12. Kamal, H., Wagner, A.: Fg-mpi: fine-grain mpi for multicore and clusters. In: Proceedings of the IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–8 (2010).

  13. Kielmann, T., Hofman, R.F.H., Bal, H.E., Plaat, A., Bhoedjang, R.A.F.: MagPIe: MPI’s collective communication operations for clustered wide area systems. In: Proceedings of the seventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming PPoPP ’99, pp. 131–140. ACM, New York (1999)

  14. Li, S., Hoefler, T., Snir, M.: Numa-aware shared-memory collective communication for mpi. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing HPDC ’13, pp. 85–96. ACM, New York (2013)

  15. Mamidala, A., Kumar, R., De, D., Panda, D.: MPI collectives on modern multicore clusters: performance optimizations and communication characteristics. In: Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid CCGRID ’08, pp. 130–137 (2008).

  16. Mellor-Crummey, J.M., Scott, M.L.: Synchronization without contention. SIGPLAN Notice 26(4), 269–278 (1991)

    Article  Google Scholar 

  17. Molka, D., Hackenberg, D., Schone, R., Muller, M.S.: Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In: Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques PACT ’09, pp. 261–270. IEEE Computer Society, Washington (2009)

  18. MPI Forum: MPI: A Message-passing interface standard. version 2.2 (2009).

  19. Negara, S., Zheng, G., Pan, K.C., Negara, N., Johnson, R.E., Kalé, L.V., Ricker, P.M.: Automatic MPI to AMPI program transformation using photran. In: Proceedings of the Conference on Parallel Processing Euro-Par, pp. 531–539. Springer, Berlin (2011)

  20. Board, OpenMP Architecture Review: Application program interface version 3, 1 (2011)

  21. Pérache, M., Carribault, P., Jourdren, H.: MPC-MPI: an MPI implementation reducing the overall memory consumption. In: Proceedings of the 16th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 94–103. Springer, Berlin (2009).

  22. Rabenseifner, R.: Optimization of collective reduction operations. Computational Science-ICCS pp. 1–9. Springer, Berlin (2004).

  23. Sistare, S., Vaart, R., Loh, E.: Optimization of MPI collectives on clusters of large-scale SMP’s. In: Proceedings of the ACM/IEEE 1999 Conference on Supercomputing (1999).

  24. Tang, H., Shen, K., Yang, T.: Program transformation and runtime support for threaded MPI execution on shared-memory machines. ACM Trans. Program. Lang. Syst. (TOPLAS) 22(4), 673–700 (2000)

    Article  Google Scholar 

  25. Tang, H., Yang, T.: Optimizing threaded MPI execution on SMP clusters. In: Proceedings of the 15th International Conference on Supercomputing ICS ’01, pp. 381–392. ACM (2001).

  26. Thakur, R., Gropp, W.: Improving the performance of collective operations in MPICH. In: Proceedings of the 10th European PVM/MPI User’s Group Meeting in Recent Advances in Parallel Virtual Machine and Message Passing Interface. Lecture Notes in Computer Science, vol. 2840, pp. 257–267. Springer, Berlin (2003).

  27. Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19, 49–66 (2005)

    Article  Google Scholar 

  28. Tipparaju, V., Nieplocha, J., Panda, D.: Fast collective operations using shared and remote memory access protocols on clusters. In: Proceedings of the International IEEE on the Parallel and Distributed Processing Symposium, (2003).

  29. Yew, P.C., Tzeng, N.F., Lawrie, D.: Distributing hot-spot addressing in large-scale multiprocessors. IEEE Trans. Comput. 36(4), 388–395 (1987)

    Google Scholar 

  30. Zhang, E.Z., Jiang, Y., Shen, X.: Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming PPoPP ’10, pp. 203–212. ACM (2010).

  31. Zhang, J., Behzad, B., Snir, M.: Optimizing the BarnesspsHut algorithm in UPC. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis SC ’11, pp. 75:1–75:11. ACM (2011).

  32. Zhu, H., Goodell, D., Gropp, W., Thakur, R.: Hierarchical collectives in MPICH2. In: Proceedings of the 16th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 325–326. Springer, Berlin (2009).

Download references

Acknowledgments

The work is supported in part by the DOE Office of Science, Advanced Scientific Computing Research, under Award number DE-FC02-10ER26011 and DOE Office of Science, Advanced Scientific Computing Research, under Award number DE-AC02-06CH11357. Li is supported in part by National Key Basic Research and Development Program of China under No.2013CB329605 and No.2013CB329606, and Key Project of the National 25th Year Research Program of China under No.2011BAK08B04. This work was supported in part by the DOE Office of Science, Advanced Scientific Computing Research, under Award number DE-FC02-10ER26011, program manager Lucy Nowell.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shigang Li.

Additional information

Shigang Li is currently a visiting graduate student at Department of Computer Science, University of Illinois at Urbana-Champaign.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, S., Hoefler, T., Hu, C. et al. Improved MPI collectives for MPI processes in shared address spaces. Cluster Comput 17, 1139–1155 (2014). https://doi.org/10.1007/s10586-014-0361-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-014-0361-4

Keywords

Navigation