ABSTRACT
Several existing compiler transformations can help improve communication-computation overlap in MPI applications. However, traditional compilers treat calls to the MPI library as a black box with unknown side effects and thus miss potential optimizations. This paper's contributions enable the development of an MPI-aware optimizing compiler that can perform transformations exploiting knowledge of MPI call effects to increase communication-computa-tion overlap. We formulate a set of data flow equations and rules to describe the side effects of key MPI functions so an MPI-aware compiler can automatically assess the safety of transformations. After categorizing existing compiler transformations based on their effect on the application code, we present an optimization algorithm that specifies when and how to apply these optimizing transformations to achieve improved communication-computation overlap. By manually applying the optimization algorithm to kernels extracted from HYCOM and the NAS benchmarks, we show that even when transforming these highly optimized codes, execution time can be decreased by an average of over 30%.
- Open64. http://open64.sourceforge.net.Google Scholar
- D. Bailey, T. Harris, W. Saphir, R. van der Wijngaart, A. Woo, and M. Yarrow. The NAS parallel benchmarks 2.0. Technical Report NAS-95-020, NASA Ames Research Center, December 1995.Google Scholar
- C. Bell, D. Bonachea, R. Nishtala, and K. Yelick. Optimizing Bandwidth Limited Problems Using One-Sided Communication and Overlap. In 20th International Parallel & Distributed Processing Symposium (IPDPS), 2006. Google ScholarDigital Library
- D. Bonachea. GASNet specification. Technical Report CSD-02-1207, University of California, Berkeley, October 2002. Google Scholar
- Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, and S. Ranka. A compilation approach for Fortran 90D/HPF compilers on distributed memory MIMD computers. In Sixth Annual Workshop on Languages and Compilers for Parallel Computing, pages 200--215, 1993. Google ScholarDigital Library
- E. P. Chassignet, L. T. Smith, G. R. Halliwell, and R. Bleck. North Atlantic simulation with the HYbrid Coordinate Ocean Model (HYCOM): Impact of the vertical coordinate choice, reference density, and thermobaricity. Journal of Physical Oceanography, 32:2504--2526, 2003.Google ScholarCross Ref
- W.-Y. Chen, D. Bonachea, C. Iancu, and K. Yelick. Automatic Nonblocking Communication for Partitioned Global Address Space Programs. In ICS'07: Proceedings of the 21st annual International Conference on Supercomputing, pages 158--167, 2007. Google ScholarDigital Library
- W.-Y. Chen, C. Iancu, and K. Yelick. Communication optimizations for fine-grained upc applications. In PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, pages 267--278, 2005. Google ScholarDigital Library
- Dale Shires and Lori Pollock and Sara Sprenkle. Program Flow Graph Construction for Static Analysis of MPI Programs. In Parallel and Distributed Processing Techniques and Applications (PDPTA'99), pages 1847--1853, June 1999.Google Scholar
- A. Danalis, A. Brown, L. Pollock, M. Swany, and J. Cavazos. Gravel: a communication library to fast path MPI. In EuroPVM/MPI, Sep 2008. Google ScholarDigital Library
- A. Danalis, K. Kim, L. Pollock, and M. Swany. Transformations to Parallel Codes for Communication-Computation Overlap. In SC'05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, 2005. Google ScholarDigital Library
- A. Danalis, L. Pollock, and M. Swany. Automatic MPI application transformation with ASPhALT. In Workshop on Performance Optimization for High-Level Languages and Libraries (POHLL 2007), in conjunction with IPDPS 2007, 2007.Google ScholarCross Ref
- A. Danalis, L. Pollock, M. Swany, and J. Cavazos. Implementing an Open64-based Tool for Improving the Performance of MPI Programs. In The Open64 Workshop, in conjunction with IEEE/ACM International Symposium on Code Generation and Optimization (CGO) 2008, Apr 2008.Google Scholar
- D. Das, M. Gupta, R. Ravindran, W. Shivani, P. Sivakeshava, and R. Uppal. Compiler-Controlled Extraction of Computation-Communication Overlap in MPI Applications. In HIPS-POHLL joint Workshop on High-Level Parallel Programming Models and Supportive Environments and Performance Optimization for High-Level Languages and Libraries held in conjunction with the 22nd IEEE International Parallel & Distributed Processing Symposium(IPDPS 2008), April 2008.Google Scholar
- T. A. El-Ghazawi, W. W. Carlson, and J. M. Draper. UPC specification v. 1.1. http://upc.gwu.edu/documentation, 2003.Google Scholar
- P. Feautrier. Array expansion. In ICS'88: Proceedings of the 2nd International Conference on Supercomputing, pages 429--441, 1988. Google ScholarDigital Library
- M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K.-Y. Wang, W.-M. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Supercomputing'95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing, page 71, New York, NY, USA, 1995. ACM. Google ScholarDigital Library
- High Performance Fortran Forum. High Performance Fortran language specification, version 1.0. CRPC-TR92225, Rice University, Houston, TX, 1993.Google Scholar
- P. Hilfinger, D. Bonachea, D. Gay, S. Graham, B. Liblit, G. Pike, and K. Yelick. Titanium language reference manual. Tech Report UCB/CSD-01-1163, U.C. Berkeley, November 2001. Google Scholar
- S. Hiranandani, K. Kennedy, and C.-W. Tseng. Compiling Fortran D for MIMD distributed-memory machines. Multiprocessor performance measurement and evaluation, pages 57--71, 1995. Google ScholarDigital Library
- T. Hoefler, P. Gottschling, and A. Lumsdaine. Leveraging non-blocking Collective Communication in high-performance Applications. In SPAA'08: Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures, pages 113--115, 2008. Google ScholarDigital Library
- T. Hoefler, A. Lumsdaine, and W. Rehm. Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI. In SC'07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pages 1--10, 2007. Google ScholarDigital Library
- C. Iancu, P. Husbands, and W. Chen. Message Strip Mining Heuristics for High Speed Networks. In VECPAR, 2004.Google Scholar
- A. Karwande, X. Yuan, and D. K. Lowenthal. CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2003. Google ScholarDigital Library
- K. Kennedy, B. Broom, K. Cooper, J. Dongarra, R. Fowler, D. Gannon, L. Johnsson, J. Mellor-Crummey, and L. Torczon. Telescoping Languages: A Strategy for Automatic Generation of Scientific Problem-Solving Systems from Annotated Libraries. Journal of Parallel and Distributed Computing, 61(12):1803--1826, 2001.Google ScholarDigital Library
- Michelle Mills Strout and Barbara Kreaseck and Paul D. Hovland. Data-Flow Analysis for MPI Programs. In International Conference on Parallel Processing (ICPP 2006), pages 175--184, Aug 2006. Google ScholarDigital Library
- S. S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers, 1997. Google ScholarDigital Library
- J. Nieplocha and B. Carpenter. ARMCI: A portable remote memory copy library for distributed array libraries and compiler run-time systems. In RTSPP IPPS/SDP'99, 1999. Google ScholarDigital Library
- R. W. Numrich and J. K. Reid. Co-Array Fortran for parallel programming. ACM Fortran Forum 17, 2, 1--31, 1998. Google ScholarDigital Library
- J. C. Sancho, K. J. Barker, D. J. Kerbyson, and K. Davis. Quantifying the potential benefit of overlapping communication and computation in large-scale scientific applications. In SC'06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 125, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
- J. C. Sancho and D. J. Kerbyson. Improving the Performance of Multiple Conjugate Gradient Solvers by Exploiting Overlap. In Euro-Par'08: Proceedings of the 14th international Euro-Par Conference on Parallel Processing, pages 688--697, Berlin, Heidelberg, 2008. Springer-Verlag. Google ScholarDigital Library
- M. M. Strout, J. Mellor-Crummey, and P. D. Hovland. Representation-Independent Program Analysis. In the Sixth ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering, September 2005. Google ScholarDigital Library
- A. Wakatani and M. Wolfe. A New Approach to Array Redistribution: Strip Mining Redistribution. In PARLE'94, Athens, Greece, Jul 1994. Google ScholarDigital Library
Index Terms
- MPI-aware compiler optimizations for improving communication-computation overlap
Recommendations
Effective communication and computation overlap with hybrid MPI/SMPSs
PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingCommunication overhead is one of the dominant factors affecting performance in high-performance computing systems. To reduce the negative impact of communication, programmers overlap communication and computation by using asynchronous communication ...
A framework for characterizing overlap of communication and computation in parallel applications
Effective overlap of computation and communication is a well understood technique for latency hiding and can yield significant performance gains for applications on high-end computers. In this paper, we propose an instrumentation framework for message-...
Overlapping communication and computation with OpenMP and MPI
Machines comprised of a distributed collection of shared memory or SMP nodes are becoming common for parallel computing. OpenMP can be combined with MPI on many such machines. Motivations for combing OpenMP and MPI are discussed. While OpenMP is ...
Comments