Skip to main content
Log in

Accurately Selecting Block Size at Runtime in Pipelined Parallel Programs

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Loops that contain cross-processor data dependencies, known as DOACROSS loops, are often found in scientific programs. Efficiently parallelizing such loops is important yet nontrivial. One useful parallelization technique for DOACROSS loops is pipelining, where each processor (node) performs its computation in blocks; after each, it sends data to the next node in the pipeline. The amount of computation before sending a message is called the block size; its choice, although difficult to make statically, is important for efficient execution. This paper describes a flexible runtime approach to choosing the block size. Rather than rely on static estimation of workload, our system takes measurements during the first two iterations of a program and then uses the results to build an execution model and choose an appropriate block size which, unlike a static choice, may be nonuniform. To increase accuracy of the chosen block size, our execution model takes intra- and inter-node performance into account. It is important to note that our system finds an effective block size automatically, without experimentation that is necessary when using a statically chosen block size. Performance on a network of workstations shows that programs that use our runtime analysis outperform those that use static block sizes by as much as 18% when the workload is unbalanced. When the workload is balanced, competitive performance is achieved as long as the initial overhead is sufficiently amortized.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

REFERENCES

  1. F. McMahon, The Livermore Fortran kernels: A computer test of the numerical performance range. Technical Report UCRL-53745, Lawrence Livermore National Laboratory (1986).

  2. Daryl Olander and Robert B. Schnabel, Preliminary experience in developing a parallel thin-layer navier stokes code and implications for parallel language design, Proc. Scalable High Performance Computing Conf. (1992).

  3. G. McRae, W. Goodin, and J. Seinfeld, Development of a second-generation model for urban air polution 1. model formulation, Atmospheric Environment 16(4):679–696 (1982).

    Google Scholar 

  4. A. R. Hurson, Joford T. Lim, Krishna M. Kavi, and Ben Lee, Parallelization of DOALL and DOACROSS Loops-A Survey, Academic Press Ltd., Vol. 45, pp. 53–103 (1997).

    Google Scholar 

  5. Ken Kennedy, Compiling a software bridge to the 21st century-invited talk at PPOPP'97 (June 1997).

  6. J. R. Allen and K. Kennedy, Automatic translation of Fortran programs to vector form, TOPLAS 9(4):491–542 (October 1987).

    Google Scholar 

  7. Chau-Wen Tseng, An optimizing Fortran D compiler for MIMD distributed-memory machines, Ph.D. thesis, Rice University (January 1993).

  8. William Blume, Ramon Doallo, Rudolf Eigenmann, John Grout, Jay Hoeflinger, Thomas Lawrence, Jaejin Lee, David Padua, Yunheung Paek, Bill Pottenger, Lawrence Rauchwerger, and Peng Tu, Parallel programming with Polaris, IEEE Computer 29(12): 78–82 (December 1996).

    Google Scholar 

  9. Mary W. Hall, Jennifer M. Anderson, Saman P. Amarasinghe, Brian R. Murphy, Shih-Wei Liao, Edouard Bugnion, and Monica S. Lam, Maximizing multiprocessor performance with the SUIF compiler, IEEE Computer 29(12):84–89 (December 1996).

    Google Scholar 

  10. H. P. Zima, H. J. Bast, and M. Gerndt, SUPERB: A tool for semi-automatic MIMD-SIMD parallelization, Parallel Computing 6(6):1–18 (January 1988).

    Google Scholar 

  11. P. Banerjee, J. A. Chandy, M. Gupta, E. W. Hodges IV, J. G. Holm, A. Lain, D. J. Palermo, S. Ramaswamy, and E. Su, The PARADIGM compiler for distributed-memory multicomputers, IEEE Computer 28(10):37–47 (October 1995).

    Google Scholar 

  12. David Padua and Michael Wolfe, Advanced compiler optimizations for supercomputers, Commun. ACM 29(12):1184–1201 (1986).

    Google Scholar 

  13. V. P. Krothapalli and P. Sadayappan, Dynamic scheduling of DOACROSS loops for multiprocessors, Proc. Parbase-90: Int'l. Conf. Databases and Parallel Architectures, pp. 66–75 (1990).

  14. A. R. Hurson, J. T. Lim, K. Kavi, and B. Shirazi, Loop allocation scheme for multithreaded dataflow computers, Proc. Eigth Int'l. Parallel Processing Symp., pp. 316–322 (1994).

  15. A. R. Hurson, J. T. Lim, and B. Lee, Extended staggered scheme: A loop allocation policy, World IMACS Conf., pp. 1321–1325 (1994).

  16. Joel H. Saltz, Ravi Mirchandaney, and Kay Crowley, Runtime parallelization and scheduling of loops, IEEE Trans. Computers 40(5):603–612 (May 1991).

    Google Scholar 

  17. Peiyi Tang and John N. Zigman, Reducing data communication overhead for doacross loop nests, Proc. ACM Int'l. Conf. Supercomputing, pp. 44–53 (1994).

  18. Ding-Kai Chen and Pen-Chung Yew, On effective execution of nonuniform doacross loops, IEEE Trans. Parallel and Distribut. Syst. 7(5):463–476 (May 1996).

    Google Scholar 

  19. V. P. Krothapalli, J. Thulasiraman, and M. Giesbrecht, Runtime parallelization of irregular doacross loops, Proc. Irregular, pp. 75–80 (1995).

  20. D. B. Loveman, Program improvement by source-to-source transformation, J. ACM 24(1):129–138 (January 1977).

    Google Scholar 

  21. V. Balasundaram, G. Fox, K. Kennedy, and U. Kremer, An static performance estimator to guide data partitioning decisions, Proc. Third ACM SIGPLAN Symp. Principles and Practices of Parallel Progr., pp. 213–223 (April 1991).

  22. David K. Lowenthal and Michael James, Runtime selection of block size in pipelined parallel programs, Proc. second Merged IPPS-SPDP, pp. 82–87 (April 1999).

  23. Robert F. Cmelik and David Keppel, A fast instruction-set simulator for execution profiling, TR SMLI 93–12, Sun Microsystems Labs (1993).

  24. David Sundaram-Stukel and Mary K. Vernon, Predictive analysis of a wavefront application using LogGP, Proc. Seventh ACM Symp. Principles and Practice of Parallel Progr., pp. 141–150 (May 1999).

  25. Stephanie Coleman and Kathryn S. McKinley, Tile size selection using cache organization and data layout, SIGPLAN: Conf. Progr. Lang. Design and Implementation (1995).

  26. Rob F. Van der Wijngaart, Sekhar R. Sarukkai, and Pankaj Mehra, The effect of interrupts on software pipeline execution on message-passing architectures, Proc. ACM Int'l. Conf. Supercomputing (May 1996).

  27. D. J. Palermo, E. Su, J. A. Chandy, and P. Banerjee, Compiler optimizations for distributed memory multicomputers used in the PARADIGM compiler, Proc. 23rd Int'l. Conf. Parallel Processing, pp. II:1–10 (August 1994).

  28. Pete Keleher and Chau-Wen Tseng, Enhancing software DSM for compiler-parallelized applications, Proc. 11th Int'l. Parallel Processing Symp. (April 1997).

  29. Honghui Lu, Alan L. Cox, Sandhya Dwarkadas, Ramakrishnan Rajamony, and Willy Zwaenepoel, Compiler and distributed shared memory support for irregular applications, Sixth ACM SIGPLAN Symp. Principles and Practice of Parallel Progr., pp. 48–56 (June 1997).

  30. Karthikeyan Balasubramanian and David K. Lowenthal, Efficient support for pipelining in distributed shared memory systems (submitted to Parallel and Distributed Computing Practices) (August 1999).

  31. Gregory R. Andrews, Foundations of Multithreaded, Parallel, and Distributed Programming, Addison-Wesley (2000).

  32. Peter Dinda, Thomas Gross, David O'Hallaron, Edward Segall, James Stichnoth, Jaspal Subhlok, Jon Webb, and Bwolen Yang, The CMU task parallel program suite. Technical Report CMU-CS-94–131, School of Computer Science, Carnegie Mellon University (March 1994).

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lowenthal, D.K. Accurately Selecting Block Size at Runtime in Pipelined Parallel Programs. International Journal of Parallel Programming 28, 245–274 (2000). https://doi.org/10.1023/A:1007577115980

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1007577115980

Navigation