# Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen 1808 # Springer Berlin Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Singapore Tokyo # Compiler Optimizations for Scalable Parallel Systems Languages, Compilation Techniques, and Run Time Systems #### Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands #### Volume Editors Santosh Pande Georgia Institute of Technology, College of Computing 801 Atlantic Drive, Atlanta, GA 30332, USA E-mail: santosh@cc.gatech.edu Dharma P. Agrawal University of Cincinnati, Department of ECECS P.O. Box 210030, Cincinnati, OH 45221-0030, USA E-mail: dpa@ececs.uc.edu Cataloging-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Compiler optimizations for scalable parallel systems: languages, compilation techniques, and run time systems / Santosh Pande; Dharma P. Agrawal (ed.). - Berlin; Heidelberg; New York; Barcelona; Hong Kong; London; Milan; Paris; Singapore; Tokyo: Springer, 2001 (Lecture notes in computer science; 1808) ISBN 3-540-41945-4 CR Subject Classification (1998): D.3, D.4, D.1.3, C.2, F.1.2, F.3 ISSN 0302-9743 ISBN 3-540-41945-4 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2001 Printed in Germany Typesetting: Camera-ready by author, data conversion by Boller Mediendesign Printed on acid-free paper SPIN: 10720238 06/3142 5 4 3 2 1 0 # **Preface** Santosh Pande<sup>1</sup> and Dharma P. Agrawal<sup>2</sup> College of Computing 801 Atlantic Drive, Georgia Institute of Technology, Atlanta, GA 30332 Department of ECECS, ML 0030, PO Box 210030, University of Cincinnati, Cincinnati, OH 45221-0030 We are very pleased to publish this monograph on Compiler Optimizations for Scalable Distributed Memory Systems. Distributed memory systems oller a challenging model of computing and pose fascinating problems regarding compiler optimizations ranging from language design to run time systems. Thus, the research done in this area serves as foundational to many challenges from memory hierarchy optimizations to communication optimizations encountered in both stand-alone and distributed systems. It is with this motivation that we present a compendium of research done in this area in the form of this monograph. This monograph is divided into $\square$ ve sections: section one deals with languages, section two deals with analysis, section three with communication optimizations, section four with code generation, and section $\square$ ve with run time systems. In the editorial we present a detailed summary of each of the chapters in these sections. We would like to express our sincere thanks to many who contributed to this monograph. First we would like to thank all the authors for their excellent contributions which really make this monograph one of a kind; as readers will see, these contributions make the monograph thorough and insightful (for an advanced reader) as well as highly readable and pedagogic (for students and beginners). Next, we would like to thank our graduate student Haixiang He for all his help in organizing this monograph and for solving latex problems. Finally we express our sincere thanks to the LNCS Editorial at Springer-Verlag for putting up with our schedule and for all their help and understanding. Without their invaluable help we would not have been able to put this monograph into its beautiful []nal shape!!! We sincerely hope the readers []nd the monograph truly useful in their work [] be it further research or practice. ## Introduction Santosh Pande<sup>1</sup> and Dharma P. Agrawal<sup>2</sup> College of Computing 801 Atlantic Drive, Georgia Institute of Technology, Atlanta, GA 30332 Department of ECECS, ML 0030, PO Box 210030, University of Cincinnati, Cincinnati, OH 45221-0030 # 1. Compiling for Distributed Memory Multiprocessors #### 1.1 Motivation The distributed memory parallel systems offer elegant architectural solutions for highly parallel data intensive applications primarily because: - ☐ They are highly scalable. These systems currently come in a variety of architectures like 3D torus, mesh and hypercube that allow addition of extra processors should the computing demands increase. Scalability is an important issue especially for high performance servers such as parallel video servers, data mining and imaging applications. - □ With increase in parallelism, there is insigni□cant degradation in memory performance since memories are isolated and decoupled from direct accesses from processors. This is especially good for data intensive applications such as parallel databases and data mining that demand considerable memory bandwidths. In contrast, the memory bandwidths may not match the increase in number of processors in shared memory systems. In fact, the overall system performance may degrade due to increased memory contention. This in turn jeopardizes scalability of application beyond a point. - □ Spatial parallelism in large applications such as Fluid Flow, Weather Modeling and Image Processing, in which the problem domains are perfectly decomposable, is easy to map on these systems. The achievable speedups are almost linear and this is primarily due to fast accesses to the data maintained in local memory. - ☐ The interprocessor communication speeds and bandwidths have dramatically improved due to very fast routing. The performance ratings ollered by newer distributed memory systems have improved although they are not comparable to shared memory systems in terms of M□ops. - ☐ Medium grained parallelism can be ellectively mapped onto the newer systems like the Meiko CS-2, Cray T3D, IBM SP1/SP2 and EM4 due to a low ratio of communication/computation speeds. Communication bottleneck has decreased compared with earlier systems and this has opened up parallelization of newer applications. ### 1.2 Complexity However, programming distributed memory systems remains very complex. Most of the current solutions mandate that the users of such machines must manage the processor allocation, data distribution and inter-processor communication in their parallel programs. Programming these systems for achieving the desired high performance is very complex. In spite of frantic demands by programmers, current solutions provided by (semi-automatic) parallelizing compilers are rather constrained. As a matter of fact, for many applications the only practical success has been through hand parallelization of codes with communication managed through MPI. In spite of a tremendous amount of research in this area, applicability of many of the compiler techniques remains rather limited and the achievable performance enhancement remains less than satisfactory. The main reasons for the restrictive solutions of ered by parallelizing compilers is the enormous complexity of the problem. Orchestrating computation and communication by suitable analysis and optimizing their performance through judicious use of underlying architectural features demands a true sophistication on the part of the compiler. It is not even clear whether these complex problems are solvable within the realm of compiler analysis and sophisticated restructuring transformations. Perhaps they are much deeper in nature and go right into the heart of design of parallel algorithms for such an underlying model of computation. The primary purpose of this monograph is to provide an insight into current approaches and point to potentially open problems that could have an impact. The monograph is organized in terms of issues ranging from programming paradigms (languages) to ellective run time systems. # 1.3 Outline of the Monograph Language design is largely a matter of legacy and language design for distributed memory systems is no exception to the rule. In section I of the monograph we examine three important approaches (one imperative, one object-oriented and one functional) in this domain that have made a significant impact. The first chapter on HPF 2.0 provides an in-depth view of data parallel language which evolved from Fortran 90. They present HPF 1.0 features such as BLOCK distribution and FORALL loop as well as new features in HPF 2.0 such as INDIRECT distribution and ON directive. They also point to the complementary nature of MPI and HPF and discuss features such as EXTRINSIC interface mechanism. HPF 2.0 has been a major commercial success with many vendors such as Portland Group and Applied Parallel Research providing highly optimizing compiler support which generates message passing code. Many research issues especially related to supporting irregular computation could prove valuable to domains such as sparse matrix computation etc. The next chapter on Sisal 90 provides a functional view of implicit paralleism specilication and mapping. Shared memory implementation of Sisal is discussed, which involves optimizations such as update in place copy elimination etc. Sisal 90 and a distributed memory implementation tion which uses message passing are also discussed. Finally multi-threaded implementations of Sisal are discussed, with a focus on multi-threaded optimizations. The newer optimizations which perform memory management in hard-ware through dynamically scheduled multi-threaded code should really prove benellcial for the performance of functional languages (including Sisal) which have an elegant programming model. The next chapter on HPC++ provides an object oriented view as well as details on a library and compiler strategy to support HPC++ level 1 release. The authors discuss interesting features related to multi-threading, barrier synchronization and remote procedure invocation. They also discuss library features that are especially useful for scientilc programming. Extensions of this work relating to newer portable languages such as Java is currently an active area of research. We also have a chapter on concurrency models of OO paradigms. The authors specilically address a problem called *inheritance anomaly* which arises when synchronization constraints are implemented within methods of a class and an attempt is made to specialize methods through inheritance mechanisms. They propose a solution to this problem by separating the specil cation of synchronization from the method specilication. The synchronization construct is not a part of the method body and is handled separately. It will be interesting to study the compiler optimizations on this model related to strength reduction of barriers, and issues such as data partitioning vs. barrier synchronizations. In section II of the monograph, we focus on various analysis techniques. Parallelism detection is very important and the 11rst chapter presents a very interesting comparative study of dillerent loop parallelization algorithms by Allen and Kennedy, Wolf and Lam, Darte and Vivien and by Feautrier. They provide comparisons in terms of their performance (ability to parallelize as well as quality of schedules generated for code generation) as well as complexity. The comparison also focusses on the type of dependence information available. Further extensions could involve run-time parallelization given more precise dependence information. Array data-low is of utmost importance in optimizations: both sequential as well as parallel. The Irst chapter on array data-low analysis examines this problem in detail and presents techniques for exact data Iow as well as for approximate data Iow. The exact solution is shown for static control programs. Authors also show applications to interprocedural cases and some important parallelization techniques such as privatization. Some interesting extensions could involve run-time data low analysis. The next chapter discusses interprocedural analysis based on guarded (predicated) array regions. This is a framework based on path-sensitive predicated data-low which provides summary information. The authors also show application of their work to improve array privatization based on symbolic propagation. Extensions of these to newer object oriented languages such as Java (which have clean class hierarchy and inheritance model) could be interesting since these programs really need such summary MOD information for performing any optimization. We linally present a very important analysis/optimization technique for array privatization. Array privatization involves removing memory-related dependences which have a signilicant impact on communication optimizations, loop scheduling etc. The authors present a demand-driven data-low formulation of the problem; an algorithm which performs single pass propagation of symbolic array expressions is also presented. This comprehensive framework implemented in a Polaris compiler is making a signilicant impact in improving many other related optimizations such as load balancing, communication etc. The next section is focussed on communication optimization. The communication optimization can be achieved through data (and iteration space) distribution, statically or dynamically. These approaches further classify into data and code alignment or simply interation space transformations such as in tiling. The communication can also be optimized in data-parallel programs through array region analysis. Finally one could tolerate some communication latency through novel techniques such as multi-threading. We have chapters which cover these broad range of topics about communication in depth. The larst chapter in this section focusses on tiling for cache-coherent multicomputers. This work derives optimal tile parameters for minimal communication in loops with all ne index expressions. The authors introduce a notion of data footprints and tile the iteration spaces so that the volume of communication is minimized. They develop an important lattice theoretic framework to precisely determine the sizes of data footprints which are very valuable not only in tiling but in many array distribution transformations. The next two chapters deal with the important problem of communication free loop partitioning. The second chapter in this section focusses on comparing dillerent methods of achieving communication-free partitioning for DOALL loops. This chapter discusses several variants of the communication-free partitioning problem involving duplication or non-duplication of data, load balancing of iteration space and aspects such as statement level vs. loop level partitioning. Several aspects such as trading parallelism to avoid inter-loop data distribution are also touched upon. Extending these techniques to broader classes of DOALL loops could enhance their applicability. The next chapter by Pingali et al. proposes a very interesting framework which $\square$ rst determines a set of constraints on data and loop iteration placement. They then determine which constraints should be left unsatis $\square$ ed to relax an overconstrained system to $\square$ nd a solution involving a large amount of parallelism. Finally, the remaining constraints are solved for data and code distribution. The systematic linear algebraic framework improves over many ad-hoc loop partitioning approaches. These approaches trade parallelism for codes that allow decoupling the issues of parallelism and communication by relaxing an appropriate constraint of the problem. However, for many important problems such as image processing applications such a relaxation is not possible. That is, one must resort to a dillerent partitioning solution based on relative costs of communication and computation. In the next chapter, for solving such a problem, a new approach has been proposed to partition iteration space by determining directions which maximally cover the communication by minimally trading parallelism. This approach allows mapping of general medium grained DOALL loops. However, the communication resulting from this iteration space partitioning can not be easily aggregated without sophisticated [back]/[linpack] mechanisms present at send/receive ends. Such extensions are desirable since aggregating communication has as signilleant impact as reducing the volume. The static data distribution and alignment typically solve the problems of communication on a loop nest by loop nest basis but rarely in an intraprocedural scope. Most of the inter-loop nest level and interprocedural boundaries require dynamic data redistribution. Banerjee et al. develop techniques that can be used to automatically determine which data partitions are most beneficial over specific sections of the program by accounting for redistribution overhead. They determine split points and phases of communication and redistribution are performed at split points. When communication must take place, it should be optimized. Also, any redundancies must be captured and eliminated. Manish Gupta in the next chapter proposes a comprehensive approach for performing global (interprocedural) communication optimizations such as vectorization, PRE, coalescing, hoisting etc. Such an interprocedural approach to communication optimization is highly prolltable in substantially improving the performance. Extending this work to irregular communication could be interesting. Finally, we present a multi-threaded approach which could hide the communication latency. Two representative applications involving bitonic sort and FFT are chosen and using line grained multi-threading on EM-X it is shown that multi-threading can substantially help in overlapping computation with communication to hide latencies up to 35 %. These methods could be especially useful for irregular computation. The Inal phase of compiling for distributed memory systems involves solving many code generation problems. Code generation problems involve, determining communication generation and doing address calculation to map global references to local ones. The next section deals with these issues. The Irst chapter presents structures and techniques for communication generation. They focus on issues such as Iexible computation partitioning (going beyond owner computes rule), communication adaptation based upon manipulating integer sets through abstract inequalities and control Iow simpli- Ication based on these. One good property of this work is that it can work with many dillerent front ends (not just data parallel languages) and the code generator has more opportunities to perform low level optimizations due to simplified control flow. The second chapter discusses basis vector based address calculation mechanisms for ell cient traversals of partitioned data. While one important issue of code generation is communication generation, a very important issue is to map global address space to local address space ell ciently. The problem is complicated due to data distributions and access strides. Ramanujam et al. present closed form expressions for basis vectors for several cases. Using the closed form expressions for the basis vectors, they derive a non-unimodular linear transformation. The <code>Inal</code> section is on supporting task parallelism and dynamic data structures. We also present a run-time system to manage irregular computation. The <code>Irst</code> chapter by Darbha et al. presents a task scheduling approach that is optimal for many practical cases. The authors evaluate its performance for many practical applications such as the Bellman-Ford algorithm, Cholesky decomposition, the Systolic algorithm etc. They show that schedules generated by their algorithm are optimal for some cases and near optimal for most others. With HPF 2.0 supporting task parallelism, this could open up many new application domains. The next two chapters describe language supports for dynamic data structures such as pointers in distributed address space. Gupta describes several extensions to C with declarations such as TREE, ARRAY, MESH to declare dynamic data structures. He then describes name generation and distribution strategies for name generation and distribution strategies. Finally he describes support for both regular as well as irregular dynamic structures. The second chapter by Rogers et al. presents an approach followed in their Olden project which uses a distributed heap. The remote access is handled by software caching or computation migration. The selection of these mechanisms is done automatically through a compile time heuristic. They provide a data layout annotation to the programmer called local path lengths which allows programmers to give hints regarding expected data layout thereby [xing these mechanisms. Both of these chapters provide highly useful insights into supporting dynamic data strutures which are very important for scalable domains of computation supported by these machines. Thus, these works should have a signilicant impact on future scalable applications supported by these systems. Finally, we present a run-time system called CHAOS which provides e $\Box$ -cient support for irregular computations. Due to indirection in many sparse matrix computations, the communication patterns are unknown at compile time in these applications. Indirection patterns have to be preprocessed, and the sets of elements to be sent and received by each processor precomputed, in order to optimize communication. In this work, the authors provide details of ell cient run time support for an *inspector* executor model. #### 1.4 Future Directions The two important bottlenecks for the use of distributed memory systems are the limited application domains and the fact that the performance is less than satisfactory. The main bottleneck seems to be handling communication. Thus, ell cient solutions must be developed. Application domains beyond regular communication can be handled by supporting a general run-time communication model. This run-time communication model must be latency hiding and should give sull cient lexibility to the compiler to defer the hard decisions to run time yet allow static optimizations involving communication motion etc. One of the big problems compilers face is that estimating cost of communication is almost impossible. They can however gauge criticality (or relative importance) of communication. Developing such a model will allow compilers to more ellectively deal with issues of relative importance betwen computation and communication and communication. Probably the best reason to use distributed memory systems is to bene throw scalability even though application domains and performance might be somewhat weaker. Thus, new research must be done in scalable code generation. In other words, as size of the problem and number of processors increase, should there be a change in data/code partition or should it remain the same? What code generation issues are related to this? How could one potentially handle the hot spots that inevitably (although at much lower levels than shared memory systems) arise? Can one benefit from the above communication model and dynamic data ownerships discussed earlier? # Table of Contents | | eface<br>ntosh Pande and Dharma P. Agrawal | V | |---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------| | | troduction<br>ntosh Pande and Dharma P. Agrawal bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb | ΧI | | 1 | Compiling for Distributed Memory Multiprocessors XX 1.1 Motivation XX 1.2 Complexity XX 1.3 Outline of the Monograph XX 1.4 Future Directions XXV | IX<br>III<br>III | | $\mathbf{Se}$ | ection I : Languages | | | | napter 1. High Performance Fortran 2.0 n Kennedy and Charles Koelbel DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD | 3 | | 1<br>2<br>3 | Introduction | 3<br>7<br>7<br>13 | | 4 | 4.1 Basic Language Features | 18<br>19<br>29 | | 5 | Task Parallelism | 34<br>34<br>37 | | 6<br>7 | Input and Output | 39<br>41 | # Chapter 2. The Sisal Project: Real World Functional Programming | Jea | an-Luc Gaudiot, Tom DeBoni, John Feo, Wim Bahm, | | |-----|----------------------------------------------------------------|----| | Wa | alid Najjar, and Patrick Miller DDDDDDDDDDDDDDDDDDDDDDDDDDDDDD | 45 | | 1 | Introduction | 45 | | 2 | The Sisal Language: A Short Tutorial | 46 | | 3 | An Early Implementation: The Optimizing Sisal Compiler | 49 | | | 3.1 Update in Place and Copy Elimination | 49 | | | 3.2 Build in Place | 50 | | | 3.3 Reference Counting Optimization | 51 | | | 3.4 Vectorization | 51 | | | 3.5 Loop Fusion, Double Bullering Pointer Swap, and Inversion | 51 | | 4 | Sisal90 | 53 | | | 4.1 The Foreign Language Interface | 54 | | 5 | A Prototype Distributed-Memory SISAL Compiler | 58 | | | 5.1 Base Compiler | 59 | | | 5.2 Rectangular Arrays | 59 | | | 5.3 Block Messages | 60 | | | 5.4 Multiple Alignment | 60 | | | 5.5 Results | 61 | | | 5.6 Further Work | 62 | | 6 | Architecture Support for Multithreaded Execution | 62 | | | 6.1 Blocking and Non-blocking Models | 63 | | | 6.2 Code Generation | 64 | | | 6.3 Summary of Performance Results | 68 | | 7 | Conclusions and Future Research | 69 | | ~. | | | | | hapter 3. HPC++ and the HPC++Lib Toolkit | | | | ennis Gannon, Peter Beckman, Elizabeth Johnson, Todd Green, | | | an | d Mike Levine обобобобобобобобобобобобобобобобобобоб | 73 | | 1 | Introduction | 73 | | 2 | The HPC++ Programming and Execution Model | 74 | | | 2.1 Level 1 HPC++ | 75 | | | 2.2 The Parallel Standard Template Library | 76 | | | 2.3 Parallel Iterators | 77 | | | 2.4 Parallel Algorithms | 77 | | | 2.5 Distributed Containers | 78 | | 3 | A Simple Example: The Spanning Tree of a Graph | 78 | | 4 | Multi-threaded Programming | 82 | | | 4.1 Synchronization | 84 | | | 4.2 Examples of Multi-threaded Computations | 92 | | 5 | Implementing the HPC++ Parallel Loop Directives | 96 | | | _ * | | | | Table of Contents | IX | |---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------| | 6<br>7<br>8 | Multi-context Programming and Global Pointers6.1 Remote Function and Member Calls16.2 Using Corba IDL to Generate Proxies1The SPMD Execution Model17.1 Barrier Synchronization and Collective Operations1Conclusion1 | .03<br>.05<br>.05 | | In | napter 4. A Concurrency Abstraction Model for Avoiding heritance Anomaly in Object-Oriented Programs andeep Kumar and Dharma P. Agrawal | 09 | | 1 2 | Introduction | .09<br>.13<br>.13 | | 3 | What Is the Inheritance Anomaly? | .15<br>.16<br>.18<br>.18 | | 4<br>5<br>6 | What Is the Reusability of Sequential Classes? | .20<br>.21 | | 7<br>8 | The Concurrency Abstraction Model1The CORE Language18.1 Specifying a Concurrent Region18.2 Dellning an AC18.3 Dellning a Parallel Block18.4 Synchronization Schemes1 | .23<br>.26<br>.26<br>.26<br>.27 | | 9<br>10<br>11 | Illustrations19.1 Reusability of Sequential Classes19.2 Avoiding the Inheritance Anomaly1The Implementation Approach1Conclusions and Future Directions1 | .30<br>.31<br>.33 | | Se | ection II : Analysis | | | | napter 5. Loop Parallelization Algorithms ain Darte, Yves Robert, and Frederic Vivien DODDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD | .41 | | 1 2 | Introduction1Input and Output of Parallelization Algorithms12.1 Input: Dependence Graph12.2 Output: Nested Loops1 | 42 | | v | Table | ~ C | Contents | |---|-------|-----|----------| | X | rabie | OI | Contents | | 3 | Dependence Abstractions | 5 | |---------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------| | | 3.1 Dependence Graphs | and Distance Sets | | | 3.2 Polyhedral Reduced | Dependence Graphs | | | 3.3 Delinition and Simu | lation of Classical Dependence | | | | | | 4 | | orithm149 | | | | | | | | ns | | 5 | | m | | | | | | | | tation | | | | hm | | | | ns | | 6 | | rithm | | | | Is Needed | | | | nces: A Motivating Example 158 | | | | 9 | | | O I | | | | 1 | | | | | ons | | | | ns | | | | | | 7 | | | | 7<br>8 | Feautrier Algorithm | | | 8 | Feautrier Algorithm Conclusion | | | 8<br><b>C</b> ł | Feautrier & Algorithm Conclusion Chapter 6. Array Data | ow Analysis | | 8<br><b>C</b> ł | Feautrier & Algorithm Conclusion Chapter 6. Array Data | | | 8<br><b>Cl</b><br>Pa | Feautrier Algorithm Conclusion Chapter 6. Array Data Caul Feautrier | | | 8<br><b>Cl</b><br>Pa<br>1 | Feautrier Algorithm Conclusion Chapter 6. Array Data Paul Feautrier | | | 8<br><b>Cl</b><br>Pa | Feautrier Algorithm Conclusion Chapter 6. Array Data Paul Feautrier Introduction Exact Array Data One of the control c | bow Analysis | | 8<br><b>Cl</b><br>Pa<br>1 | Feautrier Algorithm Conclusion Chapter 6. Array Data Caul Feautrier DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD | 169 ow Analysis | | 8<br><b>Cl</b><br>Pa<br>1 | Feautrier Algorithm Conclusion Chapter 6. Array Data Caul Feautrier Introduction Exact Array Data Cow Age 2.1 Notations 2.2 The Program Model | 169 OW Analysis | | 8<br><b>Cl</b><br>Pa<br>1 | Feautrier Algorithm Chapter 6. Array Data Caul Feautrier Properties Introduction | 169 OW Analysis DEDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD | | 8<br><b>Cl</b><br>Pa<br>1 | Feautrier Algorithm Conclusion Chapter 6. Array Data Caul Feautrier Introduction Exact Array Data Cow Accept Notations 2.1 Notations 2.2 The Program Model 2.3 Data Flow Analysis 2.4 Summary of the Alg | 169 Ow Analysis DODDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD | | 8<br>Cli<br>Pa<br>1<br>2 | Feautrier Algorithm Conclusion Chapter 6. Array Data Caul Feautrier DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD | 169 OW Analysis DEDEDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD | | 8<br><b>Cl</b><br>Pa<br>1 | Feautrier Algorithm Conclusion Chapter 6. Array Data Conclusion Chapter 6. Array Data Conclusion Introduction Exact Array Data Cow Array Data Cow Array Data Cow Array Data Cow Array Data Cow Analysis 2.4 Summary of the Alg 2.5 Related Work Approximate Array Data | Dow Analysis Debodddddddddddddddddddddddddddddddddddd | | 8<br>Cli<br>Pa<br>1<br>2 | Feautrier Algorithm Conclusion Chapter 6. Array Data Conclusion Chapter 6. Array Data Conclusion Introduction Exact Array Data Conclusion 2.1 Notations 2.2 The Program Model 2.3 Data Flow Analysis 2.4 Summary of the Alg 2.5 Related Work Approximate Array Data 3.1 From ADA to FADA | DW Analysis DEDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD | | 8<br>Cli<br>Pa<br>1<br>2 | Feautrier Algorithm Chapter 6. Array Data Claul Feautrier DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD | 169 Ow Analysis 173 174 175 176 176 176 181 181 190 190 191 195 | | 8<br>Cli<br>Pa<br>1<br>2 | Feautrier Algorithm Chapter 6. Array Data Caul Feautrier Introduction Exact Array Data Owner 2.1 Notations 2.2 The Program Model 2.3 Data Flow Analysis 2.4 Summary of the Alg 2.5 Related Work Approximate Array Data 3.1 From ADA to FADA 3.2 Introducing Parame 3.3 Taking Properties of | DW Analysis DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD | | 8<br>Cli<br>Pa<br>1<br>2 | Feautrier Algorithm Chapter 6. Array Dataller Caul Feautrier DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD | Dw Analysis DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD | | 8 Ch Pa 1 2 3 | Feautrier Algorithm Conclusion Chapter 6. Array Data 6 | Dow Analysis Dobbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb | | 8<br>Cli<br>Pa<br>1<br>2 | Feautrier Algorithm Conclusion Chapter 6. Array Data 6 | ow Analysis beloodbodbodbodbodbodbodbodbodbodbodbodbodb | | 8 Ch Pa 1 2 3 | Feautrier Algorithm Conclusion Chapter 6. Array Data 6 | ow Analysis Debodddddddddddddddddddddddddddddddddddd | | 8 Ch Pa 1 2 3 | Feautrier Algorithm Chapter 6. Array Data Caul Feautrier DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD | ow Analysis beloodbodbodbodbodbodbodbodbodbodbodbodbodb | | | | Table of Contents | XI | |---------------|------------------------------------------------------------------------------------------------------------------|-----------------------------------------|------------------------------------------------------| | 5<br>6<br>A | Applications of ADA and FADA | | 209<br>211<br>212<br>214<br>214<br>214<br>215<br>216 | | Aı | hapter 7. Interprocedural Analysis Based of rray Regions iyuan Li, Junjie Gu, and Gyungho Lee | | 221 | | $\frac{2}{1}$ | IntroductionPreliminary | | 221 | | | <ul><li>2.1 Traditional Flow-Insensitive Summaries</li><li>2.2 Array Data Flow Summaries</li></ul> | | 225 | | 3 | Guarded Array Regions | | 228 | | 4 | Constructing Summary GARL Interprocedurally 4.1 Hierarchical Supergraph | · | 232 | | _ | 4.2 Summary Algorithms | | 235 | | 5 | Implementation Considerations 5.1 Symbolic Analysis 5.2 Region Numbering | | 238 | | 6 | 5.3 Range Operations | | | | 7<br>8 | Results | | 241 $241$ $243$ | | | hapter 8. Automatic Array Privatization ng Tu and David Padua >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> | >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> | 247 | | 1<br>2<br>3 | Introduction Background Algorithm for Array Privatization 3.1 Data Flow Framework 3.2 Inner Loop Abstraction | | 248<br>250<br>250<br>252 | | | 3.3 An Example | | 256 | | XII | | Contents | |-----|--|----------| | | | | | | | | | | | | | | 3.4 Prolltability of Privatization | 257 | |----|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------| | | 3.5 Last Value Assignment | | | 4 | Demand-Driven Symbolic Analysis | 261 | | | 4.1 Gated Single Assignment | 263 | | | 4.2 Demand-Driven Backward Substitution | | | | 4.3 Backward Substitution in the Presence of Gating Functions | 266 | | | 4.4 Examples of Backward Substitution | 267 | | | 4.5 Bounds of Symbolic Expression | 269 | | | 4.6 Comparison of Symbolic Expressions | 269 | | | 4.7 Recurrence and the $\theta$ Function | | | | 4.8 Bounds of Monotonic Variables | | | | 4.9 Index Array | | | | 4.10 Conditional Data Flow Analysis | | | | 4.11 Implementation and Experiments | | | 5 | Related Work | 277 | | ~ | | | | Se | ection III: Communication Optimizations | | | | bankan 0. Ontina l Tilia a fan Minimiaia a Gamanania tian in | | | | hapter 9. Optimal Tiling for Minimizing Communication in | | | | istributed Shared-Memory Multiprocessors<br>nant Agarwal, David Kranz, Rajeev Barua, and Venkat Natarajan ⊳⊳⊳ | . 285 | | ΑI | | | | 1 | Introduction | | | | 1.1 Contributions and Related Work | | | | 1.2 Overview of the Paper | | | 2 | Problem Domain and Assumptions | | | | 2.1 Program Assumptions | | | | 2.2 System Model | | | 3 | Loop Partitions and Data Partitions | | | 4 | A Framework for Loop and Data Partitioning | | | | 4.1 Loop Tiles in the Iteration Space | | | | 4.2 Footprints in the Data Space | | | | 4.3 Size of a Footprint for a Single Reference | | | | 4.4 Size of the Cumulative Footprint | | | | 4.5 Minimizing the Size of the Cumulative Footprint | | | 5 | General Case of <b>G</b> | | | | 5.1 <b>G</b> Is Invertible, but Not Unimodular | | | | 5.2 Columns of <b>G</b> Are Dependent and the Rows Are Independent . | | | _ | 5.3 The Rows of <b>G</b> Are Dependent | | | 6 | Other System Environments | | | | | 0 | | | 6.1 Coherence-Related Cache Misses | | | | <ul> <li>6.1 Coherence-Related Cache Misses</li> <li>6.2 Ellect of Cache Line Size</li> <li>6.3 Data Partitioning in Distributed-Memory Multicomputers</li> </ul> | 320 | | | Table of Contents | XIII | |--------|----------------------------------------------------------------------------------------|-------------------| | 7 | Combined Loop and Data Partitioning in DSMs | 322 | | 8 | Implementation and Results | $\frac{328}{330}$ | | 9 | Conclusions | 334 | | A<br>B | A Formulation of Loop Tiles Using Bounding Hyperplanes | | | | hapter 10. Communication-Free Partitioning of Nested | | | | uei-Ping Shih, Chua-Huang Huang, and Jang-Ping Sheu ▷▷▷▷▷▷▷▷▷▷▷ | > 339 | | 1 | Introduction | | | 2 | Fundamentals of Array References | | | | <ul><li>2.1 Iteration Spaces and Data Spaces</li><li>2.2 Reference Functions</li></ul> | | | | 2.3 Properties of Reference Functions | | | 3 | Loop-Level Partitioning | | | | 3.1 Iteration and Data Spaces Partitioning [] Uniformly Generated | | | | References | | | | 3.2 Hyperplane Partitioning of Data Space | | | 4 | 3.3 Hyperplane Partitioning of Iteration and Data Spaces | | | 4 | 4.1 All ne Processor Mapping | | | | 4.2 Hyperplane Partitioning | | | 5 | Comparisons and Discussions | 377 | | 6 | Conclusions | 381 | | | hapter 11. Solving Alignment Using Elementary Linear<br>Igebra | | | | adimir Kotlyar, David Bau, Induprakas Kodukula, Keshav Pingali, | | | an | d Paul Stodghill ddddddddddddddddddddddddddddddddddd | > 385 | | 1 | Introduction | 385 | | 2 | 9 | | | | 2.1 Equational Constraints | | | | 2.2 Reduction to Null Space Computation | | | | 2.4 Reducing the Solution Basis | | | 3 | All ne Alignment | | | | 3.1 Encoding All ne Constraints as Linear Constraints | 393 | | 4 | Replication | | | | 4.1 Formulation of Replication | 397 | | 5 | Heuristics | 399 | |--------|----------------------------------------------------------------------------------------------------|-----| | 6 | 5.2 Implications for Alignment Heuristic | | | 6<br>A | Conclusion | | | А | A.1 Unrelated Constraints | | | | A.2 General Procedure | | | В | A Comment on All ne Encoding | | | | hapter 12. A Compilation Method for | | | | ommunication [E] cient Partitioning of DOALL Loops | | | Sa | ntosh Pande and Tareq Bali bddddddddddddddddddddddddddddddddd | 413 | | 1 | Introduction | 413 | | 2 | DOALL Partitioning | 414 | | | 2.1 Motivating Example | 415 | | | 2.2 Our Approach | 419 | | 3 | Terms and Dellnitions | 421 | | | 3.1 Example | 422 | | 4 | Problem | | | | 4.1 Compatibility Subsets | | | | 4.2 Cyclic Directions | | | 5 | Communication Minimization | | | | 5.1 Algorithm : Maximal Compatibility Subsets | | | | 5.2 Algorithm : Maximal Fibonacci Sequence | | | | 5.3 Data Partitioning | | | 6 | Partition Merging | | | | 6.1 Granularity Adjustment | | | | 6.2 Load Balancing | | | | 6.3 Mapping | | | 7 | Example: Texture Smoothing Code | | | 8 | Performance on Cray T3D | | | | 8.1 Conclusions | 440 | | | t 12 Cil Otiiti f Di D-t- | | | | hapter 13. Compiler Optimization of Dynamic Data stributions for Distributed-Memory Multicomputers | | | | aniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee | 115 | | Da | , g | | | 1 | Introduction | | | 2 | Related Work | | | 3 | Dynamic Distribution Selection | | | | 3.1 Motivation for Dynamic Distributions | | | | 3.2 Overview of the Dynamic Distribution Approach | | | | 3.3 Phase Decomposition | | | | 3.4 Phase and Phase Transition Selection | 457 | | | Table of Contents | XV | |----|-------------------------------------------------------------------|-----| | 4 | Data Redistribution Analysis | 462 | | | 4.1 Reaching Distributions and the Distribution Flow Graph | | | | 4.2 Computing Reaching Distributions | | | | 4.3 Representing Distribution Sets | | | 5 | Interprocedural Redistribution Analysis | | | • | 5.1 Distribution Synthesis | | | | 5.2 Redistribution Synthesis | 468 | | | 5.3 Static Distribution Assignment (SDA) | | | 6 | Results | | | U | 6.1 Synthetic HPF Redistribution Example | | | | 6.2 2-D Alternating Direction Implicit (ADI2D) Iterative Method . | | | | 6.3 Shallow Water Weather Prediction Benchmark | | | 7 | Conclusions | | | ' | Colletusions | 400 | | Ch | napter 14. A Framework for Global Communication | | | Ar | nalysis and Optimizations | | | Ma | anish Gupta oooooooooooooooooooooooooooooooooooo | 485 | | 1 | Introduction | 125 | | 2 | Motivating Example | | | 3 | Available Section Descriptor | | | 3 | 3.1 Representation of ASD | | | | 3.2 Computing Generated Communication | | | 4 | Data Flow Analysis | | | 4 | · · | | | | 4.1 Data Flow Variables and Equations | | | | 4.2 Decomposition of Bidirectional Problem | 498 | | _ | 4.3 Overall Data-Flow Procedure | | | 5 | Communication Optimizations | | | | 5.1 Elimination of Redundant Communication | | | | 5.2 Reduction in Volume of Communication | 506 | | | 5.3 Movement of Communication for Subsumption and for Hiding | | | _ | Latency | | | 6 | Extensions: Communication Placement | | | 7 | Operations on Available Section Descriptors | | | | 7.1 Operations on Bounded Regular Section Descriptors | | | | 7.2 Operations on Mapping Function Descriptors | | | 8 | Preliminary Implementation and Results | | | 9 | Related Work | | | | 9.1 Global Communication Optimizations | | | | 9.2 Data Flow Analysis and Data Descriptors | 520 | | 10 | Conclusions | 521 | | Dу | napter 15. Tolerating Communication Latency through<br>ynamic Thread Invocation in a Multithreaded Architecture<br>adrew Sohn, Yuetsu Kodama, Jui-Yuan Ku, Mitsuhisa Sato, and | | |---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------| | Yo | shinori Yamaguchi dddddddddddddddddddddddddddddddddddd | 525 | | 1 2 | Introduction. Multithreading Principles and Its Realization. 2.1 The Principle. 2.2 The EM-X Multithreaded Distributed-Memory Multiprocessor. | 527<br>527<br>530 | | 3 | 2.3 Architectural Support for Fine-Grain Multithreading 5 Designing Multithreaded Algorithms 5 3.1 Multithreaded Bitonic Sorting 5 3.2 Multithreaded Fast Fourier Transform 5 | 535<br>535 | | 4<br>5<br>6 | Overlapping Analysis | 544 | | $\mathbf{Se}$ | ection IV : Code Generation | | | $\mathbf{Pe}$ | napter 16. Advanced Code Generation for High erformance Fortran kram Adve and John Mellor-Crummey DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD | 553 | | 1 2 | Introduction | 556<br>556 | | 3 | Generation | 561<br>561 | | 4 | Computation Partitioning | 565<br>567 | | 5 | Communication Code Generation | | | | 5.2 Recognizing In-Place Communication | | | 6 | Control Flow Simpli Cation | 584 | | 7 | 6.2 Overview of Algorithm | 589 | | Address Generation for Block-Cyclic Distributions | | | |---------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|--| | | Ramanujam обобобобобобобобобобобобобобобобобобоб | | | 1 | Introduction | | | 2 | Background and Related Work | | | 2 | 2.1 Related Work on One-Level Mapping | | | | 2.2 Related Work on Two-Level Mapping | | | 3 | A Lattice Based Approach for Address Generation 603 | | | | 3.1 Assumptions | | | | 3.2 Lattices | | | 4 | Determination of Basis Vectors | | | | 4.1 Basis Determination Algorithm 607 | | | | 4.2 Extremal Basis Vectors | | | | 4.3 Improvements to the Algorithm for $s < k$ | | | | 4.4 Complexity | | | 5 | Address Sequence Generation by Lattice Enumeration 614 | | | 6 | Optimization of Loop Enumeration: GO-LEFT and GO-RIGHT 616 | | | | 6.1 Implementation | | | 7 | Experimental Results for One-Level Mapping 620 | | | 8 | Address Sequence Generation for Two-Level Mapping 626 | | | | 8.1 Problem Statement | | | 9 | Algorithms for Two-Level Mapping | | | | 9.1 Itable: An Algorithm That Constructs a Table of Ollsets 629 | | | | 9.2 Optimization of the <i>Itable</i> Method 631 | | | | 9.3 Search-Based Algorithms | | | | Experimental Results for Two-Level Mapping | | | 11 | Other Problems in Code Generation | | | | 11.1 Communication Generation | | | | 11.2 Union and Dillerence of Regular Sections | | | | 11.3 Code Generation for Complex Subscripts | | | | 11.4 Data Structures for Runtime Ell ciency | | | 10 | 11.5 Array Redistribution | | | 12 | Summary and Conclusions | | | | ection V : Task Parallelism, Dynamic Data | | | St | cructures and Run Time Systems | | | $\mathbf{M}$ | napter 18. A Duplication Based Compile Time Scheduling ethod for Task Parallelism khar Darbha and Dharma P. Agrawal DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD | | | 1 | Introduction | | | 2 | STDS Algorithm | | | <i>\( \)</i> | 2.1 Complexity Analysis 663 | | ## XVIII Table of Contents | 3<br>4 | Illustration of the STDS Algorithm 664 Performance of the STDS Algorithm 670 4.1 CRC Is Satis led 670 4.2 Application of Algorithm for Random Data 672 4.3 Application of Algorithm to Practical DAGs 674 4.4 Scheduling of Diamond DAGs 675 4.5 Comparison with Other Algorithms 680 Conclusions 680 | |-------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Da | napter 19. SPMD Execution in the Presence of Dynamic ata Structures | | Ra | .jiv $\operatorname{Gupta}$ oddoddddddddddddddddddddddddddddddddd | | 1 2 | Introduction683Language Support for Regular Data Structures6842.1 Processor Structures685 | | | 2.2 Dynamic Data Structures6852.3 Name Generation and Distribution Strategies6852.4 Examples685 | | 3 | Compiler Support for Regular Data Structures | | 4<br>5<br>6 | 3.2 Translation of Pointer Operations694Supporting Irregular Data Structures705Compile-Time Optimizations705Related Work706 | | Ol | napter 20. Supporting Dynamic Data Structures with den | | Ma | artin C. Carlisle and Anne Rogers >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> | | 1 2 | Introduction709Programming Model7112.1 Programming Language7112.2 Data Layout7122.3 Marking Available Parallelism714 | | 3 | Execution Model | | 4 | Selecting Between Mechanisms | | | Table of Contents XIX | |----|------------------------------------------------------------------------------------------------------------------------------------| | 5 | Experimental Results | | | 5.1 Comparison with Other Published Work | | | 5.2 Heuristic Results | | | 5.3 Summary | | 6 | Prolling in Olden | | | 6.1 Verifying Local Path Lengths | | 7 | Related Work | | | 7.1 Guptals Work | | | 7.2 Object-Oriented Systems | | | 7.3 Extensions of C with Fork-Join Parallelism | | | 7.4 Other Related Work | | 8 | Conclusions | | Co | apter 21. Runtime and Compiler Support for Irregular emputations ja Das, Yuan-Shin Hwang, Joel Saltz, and Alan Sussman ▷▷▷▷▷▷▷ 751 | | 1 | Introduction | | 2 | Overview of the CHAOS Runtime System | | 3 | Compiler Transformations | | | 3.1 Transformation Example | | | 3.2 Dellnitions | | | 3.3 Transformation Algorithm | | 4 | Experiments | | | 4.1 Hand Parallelization with CHAOS | | | 4.2 Compiler Parallelization Using CHAOS | | 5 | Conclusions | | Αι | ithor Index oooooooooooooooooooooooooooooooooooo |