Abstract
Effective overlap of computation and communication is a well understood technique for latency hiding and can yield significant performance gains for applications on high-end computers. In this paper, we propose an instrumentation framework for message-passing systems to characterize the degree of overlap of communication with computation in the execution of parallel applications. The inability to obtain precise time-stamps for pertinent communication events is a significant problem, and is addressed by generation of minimum and maximum bounds on achieved overlap. The overlap measures can aid application developers and system designers in investigating scalability issues. The approach has been used to instrument two MPI implementations as well as the ARMCI system. The implementation resides entirely within the communication library and thus integrates well with existing approaches that operate outside the library. The utility of the framework is demonstrated by analyzing communication-computation overlap for micro-benchmarks and the NAS benchmarks, and the insights obtained are used to modify the NAS SP benchmark, resulting in improved overlap.
Similar content being viewed by others
References
Bhoedjang, R.A.F., Ruhl, T., Bal, H.E.: User-level network interface protocols. In: IEEE Computer, pp. 53–60, November 1998
Boden, N.J., Cohen, D., Felderman, R.E., Kulawik, A.E., Seitz, C.L., Seizovic, J.N., Su, W.: Myrinet: a gigabit-per-second local area network. IEEE Micro 15(1), 29–36 (1995)
Brightwell, R., Underwood, K.D.: An analysis of the impact of mpi overlap and independent progress. In: International Conference on Supercomputing (ICS) (2004)
Brightwell, R., Underwood, K.D., Riesen, R.: An initial analysis of the impact of overlap and independent progress for MPI. In: EuroPVM/MPI (2004)
DEEP/MPI, http://www.crescentbaysoftware.com/deep_mpi_top.html
Dimemas, http://www.cepba.upc.es/dimemas
Dimitrov, R.: Overlapping of communication and computation and early binding: Fundamental mechanisms for improving parallel performance on clusters of workstations. PhD thesis, Mississippi State University (2001)
Dimitrov, R.: ChaMPIon/Pro—The complete MPI-2 for massively parallel Linux, Linux clusters: the HPC revolution (2004)
Dimitrov, R., Skjellum, A.: Impact of latency on applications’ performance. In: Fourth MPI Developer’s and User’s Conference, March 2000
Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, concept, and design of a next generation MPI implementation. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, September 2004, pp. 97–104
InfiniBand Trade Association: InfiniBand, http://www.infinibandta.org
Intel Trace Analyzer and Collector, http://www.intel.com/cd/software/products/asmo-na/eng/cluster/tanalyzer/index.htm
Lawry, B., Wilson, R., Maccabe, A.B., Brightwell, R.: COMB: a portable benchmark suite for assessing MPI overlap. In: IEEE Cluster (2002)
Liu, J., Jiang, W., Wyckoff, P., Panda, D.K., Ashton, D., Buntinas, D., Gropp, W., Toonen, B.: Design and implementation of MPICH2 over InfiniBand with RDMA support. In: International Parallel and Distributed Processing Symposium (IPDPS), April 2004
Mellanox Technologies: Mellanox VAPI interface, July 2002
Message Passing Interface Forum: MPI: a message-passing interface standard, March 1994
Moore, S., Cronk, D., London, K., Dongarra, J.: Review of performance analysis tools for MPI parallel programs. In: 8th European PVM/MPI Users’ Group Meeting. Lecture Notes in Computer Science, vol. 2131, pp. 241–248. Springer, Berlin (2001)
Nieplocha, J., Tipparaju, V., Krishnan, M., Panda, D.K.: High performance remote memory access communications: the ARMCI approach. Int. J. High Perform. Comput. Appl. 20(2), 233–253 (2006)
Nieplocha, J., Tipparaju, V., Krishnan, M., Santhanaraman, G., Panda, D.K.: Optimisation and performance evaluation of mechanisms for latency tolerance in remote memory access communication on clusters. Int. J. High Perform. Comput. Netw. (IJHPCN) 2(2-4) (2004)
Paradyn, http://www.paradyn.org
Paraver, http://www.cepba.upc.es/paraver
PERUSE, http://www.mpi-peruse.org
Petrini, F., Feng, W., Hoisie, A., Coll, S., Frachtenberg, E.: The quadrics network (QsNet): high-performance clustering technology. In: Hot Interconnects 9, August 2001
Sur, S., Jin, H.-W., Chai, L., Panda, D.K.: RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits. In: Symposium on Principles and Practice of Parallel Programming (PPOPP), March 2006
Tipparaju, V., Krishnan, M., Nieplocha, J., Santhanaraman, G., Panda, D.K.: Exploiting Nonblocking Remote Memory Access Communication in Scientific Benchmarks on Clusters. In: HiPC (2003)
Tipparaju, V., Santhanaraman, G., Nieplocha, J., Panda, D.K.: Host assisted zero-copy remote memory access communication on InfiniBand. In: International Parallel and Distributed Processing Symposium (IPDPS), April 2004
Vetter, J.S.: Performance analysis of distributed applications using automatic classification of communication inefficiencies. In: International Conference on Supercomputing (ICS) (2000)
Vetter, J.S.: Dynamic statistical profiling of communication activity in distributed applications. In: International Conference on Measurement and Modeling of Computer Systems (2002)
Vetter, J.S., McCracken, M.O.: Statistical scalability analysis of communication operations in distributed applications. In: Symposium on Principles and Practice of Parallel Programming (PPOPP) (2001)
White, J.B., Bova, S.W.: Where’s the overlap? An analysis of popular MPI implementations. In: Third MPI Developers’ and Users’ Conference, March 1999
Woodall, T., Graham, R., Castain, R., Daniel, D., Sukalski, M., Fagg, G., Gabriel, E., Bosilca, G., Angskun, T., Dongarra, J., Squyres, J., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A.: TEG: A high-performance, scalable, multi-network point-to-point communications methodology. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, September 2004, p. 303–310
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shet, A.G., Sadayappan, P., Bernholdt, D.E. et al. A framework for characterizing overlap of communication and computation in parallel applications. Cluster Comput 11, 75–90 (2008). https://doi.org/10.1007/s10586-007-0046-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-007-0046-3