Modern compilers are responsible for translating the idealistic operational semantics of the source program into a form that makes efficient use of a highly complex heterogeneous machine. Since optimization problems are associated with huge and unstructured search spaces, this combinational task is poorly achieved in general, resulting in weak scalability and disappointing sustained performance. We address this challenge by working on the program representation itself, using a semi-automatic optimization approach to demonstrate that current compilers offen suffer from unnecessary constraints and intricacies that can be avoided in a semantically richer transformation framework. Technically, the purpose of this paper is threefold: (1) to show that syntactic code representations close to the operational semantics lead to rigid phase ordering and cumbersome expression of architecture-aware loop transformations, (2) to illustrate how complex transformation sequences may be needed to achieve significant performance benefits, (3) to facilitate the automatic search for program transformation sequences, improving on classical polyhedral representations to better support operation research strategies in a simpler, structured search space. The proposed framework relies on a unified polyhedral representation of loops and statements, using normalization rules to allow flexible and expressive transformation sequencing. Thisrepresentation allows to extend the scalability of polyhedral dependence analysis, and to delay the (automatic) legality checks until the end of a transformation sequence. Our work leverages on algorithmic advances in polyhedral code generation and has been implemented in a modern research compiler.
Similar content being viewed by others
References
T. Kisuki, P. Knijnenburg, M. O’Boyle, and H. Wijshoff, Iterative Compilation in Program Optimization, Proc. CPC’10 (Compilers for Parallel Computers), pp. 35–44 (2000).
Cooper K.D., Subramanian D., Torczon L. (2002). Adaptive Optimizing Compilers for the 21st Century. J. Supercomput. 23(1):7–22
S. Long and M. O’Boyle, Adaptive Java Optimisation Using Instance-based Learning, ACM Intl. Conf. Supercomput. (ICS’04), pp. 237–246, St-Malo, France (June 2004).
D. Parello, O. Temam, and J.-M. Verdun, On Increasing Architecture Awareness in Program Optimizations to Bridge the Gap between Peak and Sustained Processor Performance? Matrix-Multiply Revisited, SuperComputing’02, Baltimore, Maryland (November 2002).
D. Parello, O. Temam, A. Cohen, and J.-M. Verdun, Towards a Systematic, Pragmatic and Architecture-Aware Program Optimization Process for Complex Processors, ACM Supercomputing’04, p. 15, Pittsburgh, Pennsylvania (November 2004).
M. J. Wolfe, High Performance Compilers for Parallel Computing, Addison-Wesley (1996).
A. Cohen, S. Girbal, D. Parello, M. Sigler, O. Temam, and N. Vasilache, Facilitating the Search for Compositions of Program Transformations, ACM Intl. Conf. on Supercomputing (ICS’05), pp. 151–160, Boston, Massachusetts (June 2005).
P. Feautrier, Some Efficient Solutions to the Affine Scheduling Problem, Part II, multidimensional time, Intl. J. Parallel Program, 21(6):389–420 (December 1992), see also Part I, one dimensional time, 21(5):315–348.
M. E. Wolf, Improving Locality and Parallelism in Nested Loops, Ph.D. thesis, Stanford University (August 1992), published as CSL-TR-92-538.
W. Kelly, Optimization within a Unified Transformation Framework, Technical Report CS-TR-3725, University of Maryland (1996).
A. W. Lim and M. S. Lam, Communication-Free Parallelization via Affine Transformations, 24th ACM Symp. on Principles of Programming Languages, pp. 201–214, Paris, France (Jan 1997).
N. Ahmed, N. Mateev, and K. Pingali, Synthesizing Transformations for Locality Enhancement of Imperfectly-nested Loop Nests, ACM Supercomputing’00 (May 2000).
A. W. Lim, S.-W. Liao, and M. S. Lam, Blocking and Array Contraction Across Arbitrarily Nested Loops Using Affine Partitioning, ACM Symp. on Principles and Practice of Parallel Programming (PPoPP’01), pp. 102–112 (2001).
W. Pugh, Uniform Techniques for Loop Optimization, ACM Intl. Conf. on Supercomputing (ICS’91), pp. 341–352, Cologne, Germany (June 1991).
Li W., Pingali K. (April 1994) A Singular Loop Transformation Framework Based on Non-singular Matrices. Intl. J. Parallel Program 22(2):183–205
Open Research Compiler, http://ipf-orc.sourceforge.net.
A. Phansalkar, A. Joshi, L. Eeckhout, and L. John, Four Generations of SPEC CPU Benchmarks: What Has Changed and What Has Not, Technical Report , University of Texas Austin (2004).
KAP C/OpenMP for Tru64 UNIX and KAP DEC Fortran for Digital UNIX, http://www.hp.com/techsevers/software/kap.html.
E. Visser, Stratego: A Language for Program Transformation based on Rewriting Strategies. System Description of Stratego 0.5, A. Middeldorp (ed.), Rewriting Techniques and Applications (RTA’01), Lecture Notes in Computer Science, Vol. 2051, pp. 357–361, Springer-Verlag (May 2001).
M. O’Boyle, MARS: a Distributed Memory Approach to Shared Memory Compilation, Proc. Language, Compilers and Runtime Systems for Scalable Computing, Springer-Verlag, Pittsburgh (May 1998).
C. Bastoul, Efficient Code Generation for Automatic Parallelization and Optimization, ISPDC’2 IEEE International Symposium on Parallel and Distributed Computing, Ljubjana, Slovenia (October 2003).
C. Bastoul, Code Generation in the Polyhedral Model Is Easier Than You Think, Parallel Architectures and Compilation Techniques (PACT’04), Antibes, France (September 2004).
Quilleré F., Rajopadhye S., Wilde D. (October 2000). Generation of Efficient Nested Loops from Polyhedra. Intl. J. Parallel Program 28(5):469–498
G.-R. Perrin and A. Darte (eds.), The Data Parallel Programming Model, number 1132 in LNCS, Springer-Verlag (1996).
A. Cohen, S. Girbal, and O. Temam, A Polyhedral Approach to Ease the Composition of Program Transformations, Euro-Par’04, number 3149 in LNCS, pp. 292–303, Springer-Verlag, Pisa, Italy (August 2004).
R. Triolet, P. Feautrier, and P. Jouvelot, Automatic parallelization of Fortran programs in the presence of procedure calls, Proc. of the 1st European Symp. on Programming (ESOP’86), number 213 in LNCS, pp. 210–222, Springer-Verlag (March 1986).
M. Griebl and J.-F. Collard, Generation of Synchronous Code for Automatic Parallelization of while Loops, in S. Haridi, K. Ali, and P. Magnusson (eds.), EuroPar’95, LNCS, Vol. 966, pp. 315–326, Springer-Verlag (1995).
Collard J.-F. (April 1995) Automatic Parallelization of While-Loops using Speculative Execution. Intl. J. Parallel Program 23(2):191–219
D. G. Wonnacott, Constraint-Based Array Dependence Analysis, Ph.D. thesis, University of Maryland (1995).
B. Creusillet, Array Region Analyses and Applications, Ph.D. thesis, École Nationale Supérieure des Mines de Paris (ENSMP), France (December 1996).
Barthou D., Collard J.-F., Feautrier P. (1997). Fuzzy Array Dataflow Analysis. J. Parallel Distributed Comput. 40:210–226
L. Rauchwerger and D. Padua, The LRPD Test: Speculative Run–Time Parallelization of Loops with Privatization and Reduction Parallelization, IEEE Trans. Parallel Distribut. Syst. Special Issue Comp. Lang. Parallel Distribut. Comput., 10(2):160–180 (1999).
Barthou D., Cohen A., Collard J.-F. (Juen 2000) Maximal Static Expansion. Intl. J. Parallel Program 28(3):213–243
A. Cohen, Program Analysis and Transformation: from the Polytope Model to Formal Languages, PhD Thesis, Université de Versailles, France (December 1999).
J.-F. Collard, Reasoning About Program Transformations, Springer-Verlag (2002).
Darte A., Robert Y., Vivien F. (2000). Scheduling and Automatic Parallelization. Birkhaüser, Boston
Darte A., Robert Y. (1994). Mapping Uniform Loop Nests onto Distributed Memory Architectures. Parallel Comput. 20(5):679–710
N. Vasilache, C. Bastoul, and A. Cohen, Polyhedral Code Generation in the Real World, Proceedings of the International Conference on Compiler Construction (ETAPS CC’06), LNCS, Springer-Verlag, Vienna, Austria (March 2006), to appear.
Allen J., Kennedy K. (October 1987). Automatic Translation of Fortran Programs to Vector Form. ACM Trans. on Programming Languages and Systems 9(4):491–542
Cooper K.D., Hall M.W., Hood R.T., Kennedy K., McKinley K.S., Mellor-Crummy J.M., Torczon L., Warren S.K. (1993). The ParaScope Parallel Programming Environment. Proc. IEEE 81(2):244–263
Blume W., Eigenmann R., Faigin K., Grout J., Hoeflinger J., Padua D., Petersen P., Pottenger W., Rauchwerger L., Tu P., Weatherford S. (December 1996) Parallel Programming with Polaris. IEEE Comput. 29(12):78–82
Hall M. et al. (December 1996) Maximizing Multiprocessor Performance with the SUIF Compiler. IEEE Comput. 29(12):84–89
S. Carr, C. Ding, and P. Sweany, Improving Software Pipelining With Unroll-and-Jam, Proceedings of the 29th Hawaii Intl. Conf. on System Sciences (HICSS’96) Volume 1: Software Technology and Architecture, IEEE Computer Society (1996).
Bik A.J.C., Girkar M., Grey P.M., Tian X. (2002). Automatic Intra-Register Vectorization for the Intel Architecture. Intl. J. Parallel Program 30(2):65–98
D. Naishlos, Autovectorization in GCC, Proceedings of the 2004 GCC Developers Summit, pp. 105–118 (2004), http://www.gccsummit.org/2004.
A. E. Eichenberger, P. Wu, and K. O’Brien, Vectorization for SIMD Architectures with Alignment Constraints, ACM Symp. on Programming Language Design and Implementation (PLDI ’04), pp. 82–93 (2004).
D. E. Maydan, S. P. Amarasinghe, and M. S. Lam, Array Dataflow Analysis and its Use in Array Privatization, 20th ACM Symp. on Principles of Programming Languages, pp. 2–15, Charleston, South Carolina (January 1993).
P. Tu and D. Padua, Automatic Array Privatization, 6th Workshop on Languages and Compilers for Parallel Computing, number 768 in LNCS, pp. 500–521, Portland, Oregon (August 1993).
Banerjee U. (1988). Dependence Analysis for Supercomputing. Kluwer Academic Publishers, Boston
W. Pugh, The Omega Test: A Fast and Practical Integer Programming Algorithm for Dependence Analysis, ACM/IEEE Conf. Supercomput., pp. 4–13, Albuquerque (August 1991).
Xue J. (1994) Automating Non-unimodular Loop Transformations for Massive Parallelism. Parallel Computing 20(5):711–728
A.-C. Guillou, F. Quilleré, P. Quinton, S. Rajopadhye, and T. Risset, Hardware Design Methodology with the Alpha Language, FDL’01, Lyon, France (September 2001).
R. Schreiber, S. Aditya, B. Rau, V. Kathail, S. Mahlke, S. Abraham, and G. Snider, High-level Synthesis of Nonprogrammable Hardware Accelerators, Technical report, Hewlett-Packard (May 2000).
P. Feautrier, Array Expansion, ACM Intl. Conf. Supercomputing, pp. 429–441, St. Malo, France (July 1988).
D. Barthou, A. Cohen, and J.-F. Collard, Maximal Static Expansion, 25th ACM Symp. on Principles of Programming Languages (PoPL’98), pp. 98–106, San Diego, California (January 1998).
Lefebvre V., Feautrier P. (1998). Automatic Storage Management for Parallel Programs. Parallel Comput. 24(3):649–671
M. M. Strout, L. Carter, J. Ferrante, and B. Simon, Schedule-Independant Storage Mapping for Loops, ACM Symp. on Architectural Support for Programming Languages and Operating Systems (ASPLOS’98), 8 (1998).
F. Quilleré and S. Rajopadhye, Optimizing Memory Usage in the Polyhedral Model, Technical Report 1228, Institut de Recherche en Informatique et Systémes Aléatoires, Université de Rennes, France (January 1999).
Feautrier P. (February 1991) Dataflow Analysis of Scalar and Array References. Intl. J. Parallel Program 20(1):23–53
J.-F. Collard, D. Barthou, and P. Feautrier, Fuzzy array dataflow analysis, ACM Symp. Principles and Practice of Parallel Programming, pp. 92–102, Santa Barbara, CA (July 1995).
D. Wonnacott and W. Pugh, Nonlinear array dependence analysis, Proc. Third Workshop on Languages, Compilers and Run-Time Systems for Scalable Computers (1995), troy, New York.
S. Rus, D. Zhang, and L. Rauchwerger, The Value Evolution Graph and its Use in Memory Reference Analysis, Parallel Architectures and Compilation Techniques (PACT’04), IEEE Computer Society, Antibes, France (2004).
C. Bastoul and P. Feautrier, More Legal Transformations for Locality, Euro-Par’10, number 3149 in LNCS, pp. 272–283, Pisa (August 2004).
C. Bastoul and P. Feautrier, Improving Data Locality by Chunking, CC Intl. Conf. on Compiler Construction, number 2622 in LNCS, pp. 320–335, Warsaw, Poland (April 2003).
Standard Performance Evaluation Corp., http://www.spec.org.
F. Chow, Maximizing Application Performance Through Interprocedural Optimization with the PathScale EKO compiler suite, http://www.pathscale.com/whitepapers.html (August 2004).
C. Bell, W.-Y. Chen, D. Bonachea, and K. Yelick, Evaluating Support for Global Address Space Languages on the Cray X1, ACM Intl. Conf. on Supercomputing (ICS’04), St-Malo, France (June 2004).
C. Coarfa, F. Zhao, N. Tallent, J. Mellor-Crummey, and Y. Dotsenko, Open-source Compiler Technology for Source-to-Source Optimization, http://www.cs.rice.edu/~johnmc/research.html (project page).
C. Bastoul, A. Cohen, S. Girbal, S. Sharma, and O. Temam, Putting Polyhedral Loop Transformations to Work, Workshop on Languages and Compilers for Parallel Computing (LCPC’03), LNCS, pp. 23–30, Springer-Verlag, College Station, Texas (October 2003).
C. Ancourt and F. Irigoin, Scanning Polyhedra with DO Loop, ACM Symp. on Principles and Practice of Parallel Programming (PPoPP’91), pp. 39–50 (June 1991).
F. Irigoin, P. Jouvelot, and R. Triolet, Semantical Interprocedural Parallelization: An Overview of the PIPS Project, ACM Intl. Conf. on Supercomputing (ICS’91), Cologne, Germany (June 1991).
T. Kisuki, P. Knijnenburg, K. Gallivan, and M. O’Boyle, The Effect of Cache Models on Iterative Compilation for Combined Tiling and Unrolling, Parallel Architectures and Compilation Techniques (PACT’00), IEEE Computer Society (October 2001).
W. Kelly, W. Pugh, and E. Rosser, Code Generation for Multiple Mappings, Frontiers’95 Symp. on the Frontiers of Massively Parallel Computation, McLean (1995).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Girbal, S., Vasilache, N., Bastoul, C. et al. Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies. Int J Parallel Prog 34, 261–317 (2006). https://doi.org/10.1007/s10766-006-0012-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-006-0012-3