ABSTRACT
Existing work that deals with parallelization of complicated reductions and scans focuses only on formalism and hardly dealt with implementation. To bridge the gap between formalism and implementation, we have integrated parallelization via matrix multiplication into compiler construction. Our framework can deal with complicated loops that existing techniques in compilers cannot parallelize. Moreover, we have sophisticated our framework by developing two sets of techniques. One enhances its capability for parallelization by extracting max-operators automatically, and the other improves the performance of parallelized programs by eliminating redundancy. We have also implemented our framework and techniques as a parallelizer in a compiler. Experiments on examples that existing compilers cannot parallelize have demonstrated the scalability of programs parallelized by our implementation.
- A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools. Addison Wesley, second edition, 2006. Google ScholarDigital Library
- R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann, 2001. Google ScholarDigital Library
- A. J. C. Bik, M. Girkar, P. M. Grey, and X. Tian. Automatic Intra-Register Vectorization for the Intel® Architecture. Int. J. Parallel Program., 30 (2): 65--98, 2002. Google ScholarDigital Library
- R. S. Bird. An Introduction to the Theory of Lists. In Logic of Programming and Calculi of Discrete Design, volume 36 of NATO ASI Series F, pages 3--42. Springer-Verlag, 1987. Google ScholarDigital Library
- D. Callahan, S. Carr, and K. Kennedy. Improving Register Allocation for Subscripted Variables. In Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation (PLDI '90), pages 177--187. ACM, 1990. Google ScholarDigital Library
- W.-N. Chin, A. Takano, and Z. Hu. Parallelization via Context Preservation. In Proceedings of IEEE International Conference on Computer Languages (ICCL '98), pages 153--162. IEEE CS Press, 1998. Google ScholarDigital Library
- K. Emoto, K. Matsuzaki, Z. Hu, and M. Takeichi. Domain-Specific Optimization Strategy for Skeleton Programs. In Euro-Par 2007 Parallel Processing, volume 4641 of Lecture Notes in Computer Science, pages 705--714. Springer, 2007. Google ScholarDigital Library
- A. L. Fisher and A. M. Ghuloum. Parallelizing Complex Scans and Reductions. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation (PLDI '94), pages 135--146. ACM, 1994. Google ScholarDigital Library
- W. Gander and G. H. Golub. Cyclic Reduction -- History and Applications. In Proceedings of the Workshop on Scientific Computing, 1997.Google Scholar
- P. M. Kogge and H. S. Stone. A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations. IEEE Trans. Comput., 22 (8): 786--793, 1973. Google ScholarDigital Library
- K. Matsuzaki. Parallel Programming with Tree Skeletons. PhD thesis, Graduate School of Information Science and Technology, The University of Tokyo, 2007.Google Scholar
- K. Matsuzaki and K. Emoto. Implementing Fusion-Equipped Parallel Skeletons by Expression Templates. In Implementation and Application of Functional Languages (IFL '09), volume 6041 of Lecture Notes in Computer Science, pages 72--89. Springer, 2010. Google ScholarDigital Library
- K. Matsuzaki, Z. Hu, and M. Takeichi. Towards Automatic Parallelization of Tree Reductions in Dynamic Programming. In Proceedings of the 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '06), pages 39--48. ACM, 2006. Google ScholarDigital Library
- A. Morihata and K. Matsuzaki. Automatic Parallelization of Recursive Functions using Quantifier Elimination. In Functional and Logic Programming (FLOPS '10), volume 6009 of Lecture Notes in Computer Science, pages 321--336. Springer, 2010. Google ScholarDigital Library
- K. Morita, A. Morihata, K. Matsuzaki, Z. Hu, and M. Takeichi. Automatic Inversion Generates Divide-and-Conquer Parallel Programs. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '07), pages 146--155, 2007. Google ScholarDigital Library
- A. Nistor, W.-N. Chin, T.-S. Tan, and N. Tapus. Optimizing the parallel computation of linear recurrences using compact matrix representations. J. Parallel Distrib. Comput., 69 (4): 373--381, 2009. Google ScholarDigital Library
- X. Redon and P. Feautrier. Detection of Scans in the Polytope Model. Parallel Algorithms Appl., 15 (3--4): 229--263, 2000.Google ScholarCross Ref
- J. H. Reif, editor. Synthesis of Parallel Algorithms. Morgan Kaufmann Pub, 1993. Google ScholarDigital Library
- S. Sato. Automatic Parallelization via Matrix Multiplication. Master's thesis, The University of Electro-Communications, 2011.Google Scholar
- H. S. Stone. An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations. J. ACM, 20 (1): 27--38, 1973. Google ScholarDigital Library
- D. N. Xu, S.-C. Khoo, and Z. Hu. PType System: A Featherweight Parallelizability Detector. In Programming Languages and Systems (APLAS '04), volume 3302 of Lecture Notes in Computer Science, pages 197--212. Springer, 2004.Google ScholarCross Ref
Index Terms
- Automatic parallelization via matrix multiplication
Recommendations
Automatic parallelization via matrix multiplication
PLDI '11Existing work that deals with parallelization of complicated reductions and scans focuses only on formalism and hardly dealt with implementation. To bridge the gap between formalism and implementation, we have integrated parallelization via matrix ...
Run-Time Parallelization and Scheduling of Loops
The authors study run-time methods to automatically parallelize and schedule iterations of a do loop in certain cases where compile-time information is inadequate. The methods presented involve execution time preprocessing of the loop. At compile-time, ...
A comparison of automatic parallelization tools/compilers on the SGI origin 2000
SC '98: Proceedings of the 1998 ACM/IEEE conference on SupercomputingPorting applications to new high performance parallel and distributed computing platforms is a challenging task. Since writing parallel code by hand is time consuming and costly, porting codes would ideally be automated by using some parallelization ...
Comments