Abstract
Loop fusion is a common optimization technique that takes several loops and combines them into a single large loop. Most of the existing work on loop fusion concentrates on the heuristics required to optimize an objective function, such as data reuse or creation of instruction level parallelism opportunities. Often, however, the code provided to a compiler has only small sets of loops that are control flow equivalent, normalized, have the same iteration count, are adjacent, and have no fusion-preventing dependences. This paper focuses on code transformations that create more opportunities for loop fusion in the IBM®XL compiler suite that generates code for the IBM family of PowerPC®processors. In this compiler an objective function is used at the loop distributor to decide which portions of a loop should remain in the same loop nest and which portions should be redistributed. Our algorithm focuses on eliminating conditions that prevent loop fusion. By generating maximal fusion our algorithm increases the scope of later transformations. We tested our improved code generator in an IBM pSeriesTM690 machine equipped with a POWER4TMprocessor using the SPEC CPU2000 benchmark suite. Our improvements to loop fusion resulted in three times as many loops fused in a subset of CFP2000 benchmarks, and four times as many for a subset of CINT2000 benchmarks.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Lim, W., Liao, S.-W., Lam, M.S.: Blocking and array contraction across arbitrarily nested loops using affine partitioning. In: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, June 2001, pp. 103–112 (2001)
Bacon, D.F., Graham, S.L., Sharp, O.J.: Compiler transformations for high performance computing. ACM Computing Surveys 26(4), 345–420 (1994)
Behling, S., Bell, R., Farrell, P., Holthoff, H., O’Connell, F., Weir, W.: The power4 processor introduction and tuning guide. Technical Report SG24-7041-00, IBM (November 2001)
Ding, C., Kennedy, K.: The memory bandwidth bottleneck and its amelioration by a compiler. In: 2000 International Parallel and Distributed Processing Symposium, Cancun, Mexico, May 2000, pp. 181–189 (2000)
Ding, C., Kennedy, K.: Improving effective bandwidth through compiler enhancement of global cache reuse. In: International Parallel and Distribute Processing Symposium, San Francisco, CA (April 2001)
Gao, G.R., Olsen, R., Sarkar, V., Thekkath, R.: Collective loop fusion for array contraction. In: 1992 Workshop on Languages and Compilers for Parallel Computing, New Haven, Conn., pp. 281–295. Springer, Berlin (1992)
Gupta, R., Bodik, R.: Adaptive loop transformations for scientific programs. In: IEEE Symposium on Parallel and Distributed Processing, San Antonio, Texas, October 1995, pp. 368–375 (1995)
Hsieh, B.-M., Hind, M., Cytron, R.: Loop distribution with multiple exits. In: Proceedings of Supercomputing, November 1992, pp. 204–213 (1992)
Kennedy, K., McKinley, K.S.: Loop distribution with arbitrary control flow. In: Proceedings of Supercomputing, pp. 407–417. IEEE Computer Society Press, Los Alamitos (1990)
Kennedy, K., McKinley, K.S.: Typed fusion with applications to parallel and sequential code generation. Technical Report CRPC-TR94646, Rice University, Center for Research on Parallel Computation (1994)
Kennedy, K., McKinley, K.S.: Maximizing loop parallelism and improving data locality via loop fusion and distribution. In: 1993 Workshop on Languages and Compilers for Parallel Computing, Portland, Ore., pp. 301–320. Springer, Berlin (1993)
Krewell, K.: Ibm’s power4 unveiling continues: New details revealed at microprocessor forum 2000. In: Microprocessor Report: The Insider’s Guide to Microprocessor Hardware (November 2000)
Kuck, D.J.: A survey of parallel machine organization and programming. ACM Computing Surveys 9(1), 29–59 (1977)
Megiddo, N., Sarkar, V.: Optimal weighted loop fusion for parallel programs. In: ACM Symposium on Parallel Algorithms and Architectures, pp. 282–291 (1997)
Muraoka, Y.: Parallelism Exposure and Exploitation in Programs. PhD thesis, University of Illinois at Urbana Champaign, Dept. of Computer Science, Report No. 71-424 (February 1971)
Singhai, S., McKinley, K.: A parameterized loop fusion algorithm for improving parallelism and cache locality. The Computer Journal 40(6), 340–355 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Blainey, B., Barton, C., Amaral, J.N. (2005). Removing Impediments to Loop Fusion Through Code Transformations. In: Pugh, B., Tseng, CW. (eds) Languages and Compilers for Parallel Computing. LCPC 2002. Lecture Notes in Computer Science, vol 2481. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11596110_21
Download citation
DOI: https://doi.org/10.1007/11596110_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30781-5
Online ISBN: 978-3-540-31612-1
eBook Packages: Computer ScienceComputer Science (R0)