ABSTRACT
The overlap of computation and communication has long been considered to be a significant performance benefit for applications. Similarly, the ability of MPI to make independent progress (that is, to make progress on outstanding communication operations while not in the MPI library) is also believed to yield performance benefits. Using an intelligent network interface to offload the work required to support overlap and independent progress is thought to be an ideal solution, but the benefits of this approach have been poorly studied at the application level. This lack of analysis is complicated by the fact that most MPI implementations do not sufficiently support overlap or independent progress. Recent work has demonstrated a quantifiable advantage for an MPI implementation that uses offload to provide overlap and independent progress. This paper extends this previous work by further qualifying the source of the performance advantage (offload, overlap, or independent progress).
- N. J. Boden, D. Cohen, R. E. F. A. E. Kulawik, C. L. Seitz, J. N. Seizovic, and W.-K. Su. Myrinet: A gigabit-per-second local area network. IEEE Micro, 15(1):29--36, Feb. 1995. Google ScholarDigital Library
- R. Brightwell. A new MPI implementation for Cray SHMEM. Technical report, Sandia National Laboratories.Google Scholar
- R. Brightwell and K. Underwood. Evaluation of an eager protocol optimization for MPI. In Proceedings of EuroPVM/MPI, September 2003.Google ScholarCross Ref
- R. Brightwell and K. D. Underwood. An analysis of NIC resource usage for offloading MPI. In Proceedings of the 2002 Workshop on Communication Architecture for Clusters, Santa Fe, NM, April 2004.Google ScholarCross Ref
- R. Brightwell and K. D. Underwood. An initial analysis of the impact of overlap and independent progress for mpi. In submitted, 2004.Google Scholar
- R. B. Brightwell and P. L. Shuler. Design and implementation of MPI on Puma portals. In Proceedings of the Second MPI Developer's Conference, pages 18--25, July 1996. Google ScholarDigital Library
- Cray Research, Inc. SHMEM Technical Note for C, SG-2516 2.3, October 1994.Google Scholar
- Infiniband Trade Association. http://www.innibandta.org, 1999.Google Scholar
- J. Liu, B. Chandrasekaran, J. Wu, W. Jiang, S. Kini, W. Yu, D. Buntinas, P. Wyckoff, and D. K. Panda. Performance comparison of MPI implementations over InfiniBand, Myrinet and Quadrics. In The International Conference for High Performance Computing and Communications (SC2003), November 2003. Google ScholarDigital Library
- A. B. Maccabe, R. Riesen, and D. W. van Dresser. Dynamic processor modes in Puma. Bulletin of the Technical Committee on Operating Systems and Application Environments (TCOS), 8(2):4--12, 1996.Google Scholar
- F. Petrini, W. chun Feng, A. Hoisie, S. Coll, and E. Frachtenberg. The Quadrics network: High-performance clustering technology. IEEE Micro, 22(1):46--57, January/February 2002. Google ScholarDigital Library
- F. Petrini, D. J. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: Identifying and eliminating the performance variability on the ASCI Q machine. In Proceedings of the 2003 Conference on High Performance Networking and Computing, November 2003. Google ScholarDigital Library
- L. Shuler, C. Jong, R. Riesen, D. van Dresser, A. B. Maccabe, L. A. Fisk, and T. M. Stallcup. The Puma operating system for massively parallel computers. In Proceeding of the 1995 Intel Supercomputer User's Group Conference. Intel Supercomputer User's Group, 1995.Google Scholar
- S. R. W. Timothy G. Mattson, David Scott. A TeraFLOPS Supercomputer in 1996: The ASCI TFLOP System. In Proceedings of the 1996 International Parallel Processing Symposium, 1996. Google ScholarDigital Library
- K. D. Underwood and R. Brightwell. The impact of mpi queue usage on mpi latency. In submitted, 2004. Google ScholarDigital Library
- J. S. Vetter and F. Mueller. Communication characteristics of large-scale scientific applications for contemporary cluster architectures. In 16th International Parallel and Distributed Processing Symposium (IPDPS'02), pages 27--29, April 2002. Google ScholarDigital Library
- F. Wong, R. Martin, R. Arpaci-Dusseau, and D. E. Culler. Architectural requirements and scalability of the NAS parallel benchmarks. In Proceedings of the SC99 Conference on High Performance Networking and Computing, November 1999. Google ScholarDigital Library
Index Terms
- An analysis of the impact of MPI overlap and independent progress
Recommendations
Implementation and performance analysis of non-blocking collective operations for MPI
SC '07: Proceedings of the 2007 ACM/IEEE conference on SupercomputingCollective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper we ...
Analyzing the Impact of Overlap, Offload, and Independent Progress for Message Passing Interface Applications
The overlap of computation and communication has long been considered to be a significant performance benefit for applications. Similarly, the ability of the Message Passing Interface (MPI) to make independent progress (that is, to make progress on ...
Performance Evaluation of MPI Implementations and MPI-Based Parallel ELLPACK Solvers
MPIDC '96: Proceedings of the Second MPI Developers ConferenceAbstract: We are concerned with the parallelization of finite element mesh generation and its decomposition, and the parallel solution of sparse algebraic equations which are obtained from the parallel discretization of second order elliptic partial ...
Comments