Skip to main content
Log in

Performance analysis and optimization of MPI collective operations on multi-core clusters

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Memory hierarchy on multi-core clusters has twofold characteristics: vertical memory hierarchy and horizontal memory hierarchy. This paper proposes new parallel computation model to unitedly abstract memory hierarchy on multi-core clusters in vertical and horizontal levels. Experimental results show that new model can predict communication costs for message passing on multi-core clusters more accurately than previous models, only incorporated vertical memory hierarchy. The new model provides the theoretical underpinning for the optimal design of MPI collective operations. Aimed at horizontal memory hierarchy, our methodology for optimizing collective operations on multi-core clusters focuses on hierarchical virtual topology and cache-aware intra-node communication, incorporated into existing collective algorithms in MPICH2. As a case study, multi-core aware broadcast algorithm has been implemented and evaluated. The results of performance evaluation show that the above methodology for optimizing collective operations on multi-core clusters is efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. TOP500 Team, TOP500 Report for November 2007, http://www.top500.org

  2. Mamidala AR, Kumar R, De D, Panda DK (2008) MPI collectives on modern multicore clusters: performance optimizations and communication characteristics. In: 8th IEEE international conference on cluster computing and the grid (CCGRID ’08)

  3. Rabenseifner R (1999) Automatic MPI counter profiling of all users: First results on a CRAY T3E 900-512. In: Proceedings of the message passing interface developer’s and user’s conference, pp 77–85

  4. Pjesivac-Grbovic J, Angskun T, Bosilca G et al (2005) Performance analysis of MPI collective operations. In: Proceedings of the 19th IEEE international parallel and distributed processing symposium (IPDPS’05)

  5. Cameron KW, Sun X-H (2003) Quantifying locality effect in data access delay: memory logP. In: Proceedings of IEEE international parallel and distributed processing symposium (IPDPs 2003), Nice, France

  6. Cameron KW, Ge R (2004) Predicting and evaluating distributed communication performance. In: Proceedings of the 2004 ACM/IEEE supercomputing conference

  7. Cameron KW, Ge R, Sun X-H (2007) log n P and log3P: accurate analytical models of point-to-point communication in distributed systems. IEEE Trans Comput 56(3):314–327

    Article  MathSciNet  Google Scholar 

  8. Thakur R, Gropp W (2003) Improving the performance of collective operations in MPICH. In: Dongarra J, Laforenza D, Orlando S (eds) Recent advances in parallel virtual machine and message passing interface. Lecture notes in computer science, vol 2840. Springer, Berlin, pp 257–267

    Chapter  Google Scholar 

  9. Rabenseifner R, Traff JL (2004) More efficient reduction algorithms for non-power-of-two number of processors in message-passing parallel systems. In: Proceedings of EuroPVM/MPI. Lecture notes in computer science. Springer, Berlin

    Google Scholar 

  10. Kielmann T, Hofman RFH, Bal HE, Plaat A, Bhoedjang RAF (1999) MagPIe: MPI’s collective communication operations for clustered wide area systems. In: Proceedings of the seventh ACM SIGPLAN symposium on principles and practice of parallel programming. ACM Press, New York, pp 131–140

    Chapter  Google Scholar 

  11. Park J-YL, Choi H-A, Nupairoj N, Ni LM (1996) Construction of optimal multicast trees based on the parameterized communication model. In: Proc int conference on parallel processing (ICPP), vol I, pp 180–187

  12. Culler DE, Karp R, Patterson DA, Sahay A, Santos E, Schauser K, Subramonian R, von Eicken T (1996) LogP: a practical model of parallel computation. Commun ACM 39:78–85

    Article  Google Scholar 

  13. Alexandrov A, Ionescu MF, Schauser K, Scheiman C (1995) LogGP: incorporating long messages into the LogP model. In: Proceedings of seventh annual symposium on parallel algorithms and architecture, Santa Barbara, CA, pp 95–105

  14. Kielmann T, Bal HE (2000) Fast measurement of LogP parameters for message passing platforms. In: Proceedings of the 15 IPDPS 2000 workshops on parallel and distributed processing, pp 1176–1183

  15. Frank MI, Agarwal A, Vernon MK (1997) LoPC: modeling contention in parallel algorithms. In: Proceedings of sixth symposium on principles and practice of parallel programming, Las Vegas, NV, pp 276–287

  16. Moritz CA, Frank MI (1998) LoGPC: modeling network contention in message-passing programs. In: Proceedings of SIGMETRICS ’98, Madison, WI, pp 254–263

  17. Ino F, Fujimoto N, Hagihara K (2001) LogGPS: a parallel computational model for synchronization analysis. In: Proceedings of PPoPP’01, Snowbird, Utah, pp 133–142

  18. Barnett M, Littlefield R, Payne D, van de Geijn R (1993) Global combine on mesh architectures with wormhole routing. In: Proceedings of the 7th international parallel processing symposium, April

  19. Scott D (1991) Efficient all-to-all communication patterns in hypercube and mesh topologies. In: Proceedings of the 6th distributed memory computing conference, pp 398–403

  20. Vadhiyar SS, Fagg GE, Dongarra J (1999) Automatically tuned collective communications. In: Proceedings of SC99: high performance networking and computing, November

  21. Faraj A, Yuan X (2005) Automatic generation and tuning of MPI collective communication routines. In: Proceedings of the 19th annual international conference on supercomputing, pp 393–402

  22. Karonis NT, de Supinski BR, Foster I, Gropp W et al (2000) Exploiting hierarchy in parallel computer networks to optimize collective operation performance. In: Proceedings of the 14th international parallel and distributed processing symposium (IPDPS’2000), pp 377–384

  23. Husbands P, Hoe JC (1998) MPI-StarT: delivering network performance to numerical applications. In: Proceedings of the 1998 ACM/IEEE SC98 conference (SC’98)

  24. Tipparaju V, Nieplocha J, Panda DK (2003) Fast collective operations using shared and remote memory access protocols on clusters. In: International parallel and distributed processing symposium

  25. Wu M-S, Kendall RA, Wright K (2005) Optimizing collective communications on SMP clusters. In: ICPP’ 2005

  26. Chai L, Hartono A, Panda DK (2006) Designing high performance and scalable MPI intra-node communication support for clusters. In: The IEEE international conference on cluster computing

  27. Asanovic K, Bodik R, Catanzaro BC et al (2006) The landscape of parallel computing research: a view from Berkeley. Electrical Engineering and Computer Sciences, University of California at Berkeley. Technical Report No: UCB/EECS-2006-183, p 12

  28. Chai L, Gao Q, Panda DK (2007) Understanding the impact of multi-core architecture in cluster computing: a case study with intel dual-core system. In: Seventh IEEE international symposium on cluster computing and the grid (CCGrid’07), pp 471–478

  29. Alam SR, Barrett RF, Kuehn JA, Roth PC, Vetter JS (2006) Characterization of scientific workloads on systems with multi-core processors. In: International symposium on workload characterization

  30. Liu J, Wu J, Panda DK (2004) High performance RDMA-based MPI implementation over InfiniBand. Int J Parallel Program

  31. Hoefler T, Lichei A, Rehm W (2007) Low-overhead LogGP parameter assessment for modern interconnection networks. In: Proceedings of IEEE international parallel and distributed processing symposium (IPDPS’2007)

  32. Curtis-Maury M, Ding X, Antonopoulos CD, Nikolopoulos DS (2005) An evaluation of OpenMP on current and emerging multithreaded/multicore processors. In: IWOMP

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bibo Tu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tu, B., Fan, J., Zhan, J. et al. Performance analysis and optimization of MPI collective operations on multi-core clusters. J Supercomput 60, 141–162 (2012). https://doi.org/10.1007/s11227-009-0296-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-009-0296-3

Keywords

Navigation