A Study of Process Arrival Patterns for MPI Collective Operations

Faraj, Ahmad; Patarasuk, Pitch; Yuan, Xin

doi:10.1007/s10766-008-0070-9

A Study of Process Arrival Patterns for MPI Collective Operations

Published: 01 February 2008

Volume 36, pages 543–570, (2008)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Ahmad Faraj¹,
Pitch Patarasuk² &
Xin Yuan²

188 Accesses
22 Citations
Explore all metrics

Abstract

Process arrival pattern, which denotes the timing when different processes arrive at an MPI collective operation, can have a significant impact on the performance of the operation. In this work, we characterize the process arrival patterns in a set of MPI programs on two common cluster platforms, use a micro-benchmark to study the process arrival patterns in MPI programs with balanced loads, and investigate the impacts of different process arrival patterns on collective algorithms. Our results show that (1) the differences between the times when different processes arrive at a collective operation are usually sufficiently large to affect the performance; (2) application developers in general cannot effectively control the process arrival patterns in their MPI programs in the cluster environment: balancing loads at the application level does not balance the process arrival patterns; and (3) the performance of collective communication algorithms is sensitive to process arrival patterns. These results indicate that process arrival pattern is an important factor that must be taken into consideration in developing and optimizing MPI collective routines. We propose a scheme that achieves high performance with different process arrival patterns, and demonstrate that by explicitly considering process arrival pattern, more efficient MPI collective routines than the current ones can be obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A New Model-Based Approach to Performance Comparison of MPI Collective Algorithms

Maximizing Communication–Computation Overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective Operations

Article 26 November 2016

Automatic Verification of Self-consistent MPI Performance Guidelines

References

The AURORA Cluster of University of Technology at Vienna. http://aurora.tuwien.ac.at
The AVIDD-T cluster. http://rac.uits.iu.edu/rats/research/avidd-t/hardware.shtml
Bruck J., Ho C., Kipnis S., Upfal E. and Weathersby D. (1997). Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Trans. Parallel Distrib. Syst. 8(11): 1143–1156
Article Google Scholar
Cohen, W.E., Mahafzah, B.A.: Statistical analysis of message passing programs to guide computer design. In: Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences, vol. 7, pp. 544–553 (1998)
Faraj, A., Yuan, X.: Communication characteristics in the NAS parallel benchmarks. In: Fourteenth IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS 2002), pp. 724–729 (2002)
Faraj, A., Yuan, X.: Automatic generation and tuning of MPI collective communication routines. The 19th ACM International Conference on Supercomputing (ICS’05), pp. 393–402. Cambridge, MA, June 20–22 (2005)
Faraj, A., Yuan, X., Lowenthal, D.K.: STAR-MPI: self tuned adaptive routines for MPI collective operations. The 20th ACM International Conference on Supercomputing (ICS’06), pp. 199–208. Queensland, Australia, June 28–July 1 (2006)
Faraj A., Yuan X. and Pitch Patarasuk (2007). A message scheduling scheme for all-to-all personalized communication on ethernet switched clusters. IEEE Trans. Parallel Distrib. Syst. 18(2): 264–276
Article Google Scholar
Faraj, A., Patarasuk, P., Yuan, X.: Bandwidth efficient all-to-all broadcast on switched clusters. Int. J. Parallel Prog. (accepted)
Karwande A., Yuan X. and Lowenthal D.K. (2005). An MPI prototype for compiled communication on ethernet switched clusters. J. Parallel Distrib. Comput. 65(10): 1123–1133
Article Google Scholar
Gropp W., Lusk E., Doss N. and Skjellum A. (1996). A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput. 22(6): 789–828
Article MATH Google Scholar
Lahaut, D., Germain, C.: Static communications in parallel scientific propgrams. In: Proceedings of the 6th International PARLE Conference on Parallel Architectures and Languages, pp. 262–276. (1994)
LAM/MPI Parallel Computing. http://www.lam-mpi.org
LAMMPS: Molecular Dynamics Simulator. Available at http://www.cs.sandia.gov/sjplimp/lammps.html
Mamidala, A., Liu, J., Panda, D.: Efficient barrier and allreduce on infiniBand clusters using hardware multicast and adaptive algorithms. IEEE International Conference on Cluster Computing, pp. 135–144 (September, 2004)
Matrix Transposition Example. http://www.sara.nl/userinfo/reservoir/mpi/mpi-intro
MPICH—A Portable Implementation of MPI. Available at http://www.mcs.anl.gov/mpi/mpich
NASA Parallel Benchmarks. Available at http://www.nas.nasa.gov/NAS/NPB
Parallel NBody Simulations. Available at http://www.cs.cmu.edu/scandal/alg/nbody.html
ParaDyn: Parallel Molecular Dynamics With the Embedded Atom Method. Available at http://www.cs.sandia.gov/sjplimp/download.html
Patarasuk, P., Faraj, A., Yuan, X.: Pipelined broadcast on ethernet switched clusters. The 20th IEEE International Parallel & Distributed Processing Symposium (IPDPS), CDROM Proceedings, April 25–29 (2006)
Petrini, F., Kerbyson, D.J., Pakin, S.: The case of the missing supercomputer performance. Achieving Optimal Performance on the 8192 Processors of ASCI Q. IEEE/ACM SC2003 Conference (2003)
Pittsburg Supercomputing Center. Available at http://www.psc.edu/machines/tcs/lemieux.html
Pjesivac-Grbovic, J., Angskun, T., Bosilca, G., Fagg, G.E., Gabriel, E., Dongarra, J.: Performance analysis of MPI collective operations. IEEE IPDPS CDROM Proceedings (2005)
Rabenseinfner, R.: Automatic MPI counter profiling of all users: first results on CRAY T3E900-512. In: Proceedings of the Message Passing Interface Developer’s and User’s Conference, pp. 77–85 (1999)
Rosenblum I., Adler J. and Brandon S. (1999). Multi-processor molecular dynamics using the Brenner potential: parallelization of an implicit multi-body potential. Int. J. Mod. Phy. 10(1): 189–203
Article Google Scholar
Sanders P. and Sibeyn J.F. (2003). A bandwidth latency tradeoff for broadcast and reduction. Inf. Proces. Lett. 86(1): 33–38
Article MathSciNet Google Scholar
Tabe T.B., Hardwick J.P. and Stout Q.F. (1995). Statistical analysis of communication time on the IBM SP2. Comput. Sci. Stat. 27: 347–351
Google Scholar
Tabe, T., Stout, Q.: The use of the MPI communication library in the NAS parallel benchmark. Technical Report CSE-TR-386-99 Department of Computer Science, University of Michigan (November, 1999)
Thakur, R., Rabenseifner, R., Gropp, W.: Optimizing of collective communication operations in MPICH. Int. J. High Perf. Comput. Appl. 19(1), 49–66, Spring (2005)
Google Scholar
Two-D FFT. http://www.mhpcc.edu/training/workshop/parallel_develop
The UC/ANL Teragrid cluster. http://www.uc.teragrid.org/tg-docs/user-guide.html
Vadhiyar, S.S., Fagg, G.E., Dongarra, J.: Automatically tuned collective communications. In: Proceedings of SC’00: High Performance Networking and Computing (CDROM proceeding, 2000)
The Virginia Numerical Bull Session Ideal Hydrodynamics. http://wonka.physics.ncsu.edu/pub/VH-1
Vetter, J.S., Yoo, A.: An empirical performance evaluation of scalable scientific applications. In: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing, pp. 1–18 (2002)
Yuan X., Melhem R. and Gupta R. (2003). Algorithms for supporting compiled communication. IEEE Trans. Parallel Distrib. Syst. 14(2): 107–118
Article Google Scholar

Download references

Author information

Authors and Affiliations

Blue Gene Software Development, IBM Corporation, Rochester, MN, 55901, USA
Ahmad Faraj
Department of Computer Science, Florida State University, Tallahassee, FL, 32306, USA
Pitch Patarasuk & Xin Yuan

Authors

Ahmad Faraj
View author publications
You can also search for this author in PubMed Google Scholar
Pitch Patarasuk
View author publications
You can also search for this author in PubMed Google Scholar
Xin Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Yuan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Faraj, A., Patarasuk, P. & Yuan, X. A Study of Process Arrival Patterns for MPI Collective Operations. Int J Parallel Prog 36, 543–570 (2008). https://doi.org/10.1007/s10766-008-0070-9

Download citation

Received: 14 May 2007
Accepted: 10 January 2008
Published: 01 February 2008
Issue Date: December 2008
DOI: https://doi.org/10.1007/s10766-008-0070-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Study of Process Arrival Patterns for MPI Collective Operations

Abstract

Access this article

Similar content being viewed by others

A New Model-Based Approach to Performance Comparison of MPI Collective Algorithms

Maximizing Communication–Computation Overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective Operations

Automatic Verification of Self-consistent MPI Performance Guidelines

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Study of Process Arrival Patterns for MPI Collective Operations

Abstract

Access this article

Similar content being viewed by others

A New Model-Based Approach to Performance Comparison of MPI Collective Algorithms

Maximizing Communication–Computation Overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective Operations

Automatic Verification of Self-consistent MPI Performance Guidelines

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation