skip to main content
10.1145/1023833.1023860acmconferencesArticle/Chapter ViewAbstractPublication PagesesweekConference Proceedingsconference-collections
Article

General loop fusion technique for nested loops considering timing and code size

Published: 22 September 2004 Publication History

Abstract

Loop fusion is commonly used to improve the instruction-level parallelism of loops for high-performance embedded computing systems. Loop fusion, however, is not always directly applicable because the fusion prevention dependencies may exist among loops. Most of the existing techniques still have limitations in fully exploiting the advantages of loop fusion. In this paper, we present a general loop fusion technique for loops or nested loops based on the loop dependency graph model, retiming, and multi-dimensional retiming concepts. We show that any "J+K" model loop can be legally fused using our legalizing fusion technique. Polynomial-time algorithms are developed to solve the loop fusion problem for "J+K" model loops considering both timing and code size of the final code. Our technique produces the final code and calculates the resultant code size directly from the retiming values. The experimental results show that our loop fusion technique always significantly reduces the schedule length.

References

[1]
R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann Publishers, Inc., 2001.]]
[2]
A. Darte. On the complexity of loop fusion. In International Conference on Parallel Architectures and Compilation Techniques, pages 149--157, Oct. 1999.]]
[3]
A. Fraboulet, G. Huard, and A. Mignotte. Loop alignment for memory accesses optimization. In 12th International Symposium on System Synthesis, pages 71--77, Nov. 1999.]]
[4]
E. Granston, R. Scales, E. Stotzer, A. Ward, and J. Zbiciak. Controlling code size of software-pipelined loops on the TMS320C6000 VLIW DSP architecture. In Proceedings of the 3rd~IEEE/ACM Workshop on Media and Streaming Processors, pages 29--38, Dec. 2001.]]
[5]
J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, Inc., 2nd edition, 1995.]]
[6]
K. Kennedy and K. S. Mckinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, Number 768, Springer-Verlag, Berlin, pages 301--320, 1993.]]
[7]
K. Kennedy and K. S. Mckinley. Typed fusion with applications to parallel and sequential code generation. Technical Report CRPC-TR94646, Center for Research on Parallel Computation, Rice University, Jan. 1994.]]
[8]
C. E. Leiserson and J. B. Saxe. Retiming synchronous circuitry. Algortithmica, 6(1):5--35, June 1991.]]
[9]
N. Manjikian and T. S. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed System, 8:193--209, Feb. 1997.]]
[10]
K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems (TOPLAS), 18(4):424--453, Jul. 1996.]]
[11]
N. Megiddo and V. Sarkar. Optimal weighted loop fusion for parallel programs. In Proceedings of the ninth annual ACM Symposium on Parallel Algorithms and Architectures, pages 282--291, 1997.]]
[12]
T. W. O'Neil and E. H.-M. Sha. Minimizing inter-iteration dependencies for loop pipelining. In ISCA 13th International Conference on Parallel and Distributed Computing Systems, Las Vegas, Nevada, pages 412--417, Aug. 2000.]]
[13]
K. K. Parhi. VLSI Digital Signal Processing Systems: Design and Implementation. John Wiley & Sons, 1999.]]
[14]
N. L. Passos and E. H.-M. Sha. Full parallelism of uniform nested loops by multi-dimensional retiming. In Proceedings of the International Conference on Parallel Processing (ICPP), volume 2, pages 130--133, 1994.]]
[15]
N. L. Passos and E. H.-M. Sha. Achieving full parallelism using multi-dimensional retiming. IEEE Transactions on Parallel and Distributed System, 7(11):1150--1163, Nov. 1996.]]
[16]
Y. Qian, S. Carr, and P. Sweany. Loop fusion for clustered VLIW architecture. In Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems, pages 112--119, 2002.]]
[17]
E. H.-M. Sha, T. W. O'Neil, and N. L. Passos. Efficient polynomial-time nested loop fusion with full parallelism. International Journal of Computers and Their Applications, 10(1):9--24, Mar. 2003.]]
[18]
S. K. Singhai and K. S. Mckinley. A parameterized loop fusion algorithm for improving parallelism and cache locality. The ComputerJournal, 40(6):340--355, June 1997.]]
[19]
S. Verdoolaege, M. Bruynooghe, and F. Catthoor. Multi-dimensional incremental loop fusion for data locality. In Proceedings of the Application-Specific Systems, Architectures, and Processors, pages 14--24, 2003.]]
[20]
M. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley Publishing Company, Inc., 1996.]]
[21]
Q. Zhuge, B. Xiao, and E.-M. Sha. Code size reduction technique and implementation for software-pipelined DSP applications. ACM Transactions on Embedded Computing Systems(TECS), 2(4):590--613, Nov. 2003.]]

Cited By

View all
  • (2018)Improving efficency of mathematical functions in image processing by loop fusion2018 5th International Conference on Electrical and Electronic Engineering (ICEEE)10.1109/ICEEE2.2018.8391357(334-339)Online publication date: May-2018
  • (2011)Loop Distribution and Fusion with Timing and Code Size OptimizationJournal of Signal Processing Systems10.1007/s11265-010-0465-x62:3(325-340)Online publication date: 1-Mar-2011
  • (2010)Energy-Aware Loop Parallelism Maximization for Multi-core DSP ArchitecturesProceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing10.1109/GreenCom-CPSCom.2010.87(205-212)Online publication date: 18-Dec-2010
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CASES '04: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems
September 2004
324 pages
ISBN:1581138903
DOI:10.1145/1023833
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 September 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. code size
  2. embedded DSP
  3. loop fusion
  4. retiming
  5. scheduling

Qualifiers

  • Article

Conference

CASES04

Acceptance Rates

Overall Acceptance Rate 52 of 230 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2018)Improving efficency of mathematical functions in image processing by loop fusion2018 5th International Conference on Electrical and Electronic Engineering (ICEEE)10.1109/ICEEE2.2018.8391357(334-339)Online publication date: May-2018
  • (2011)Loop Distribution and Fusion with Timing and Code Size OptimizationJournal of Signal Processing Systems10.1007/s11265-010-0465-x62:3(325-340)Online publication date: 1-Mar-2011
  • (2010)Energy-Aware Loop Parallelism Maximization for Multi-core DSP ArchitecturesProceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing10.1109/GreenCom-CPSCom.2010.87(205-212)Online publication date: 18-Dec-2010
  • (2009)Optimal loop parallelization for maximizing iteration-level parallelismProceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems10.1145/1629395.1629407(67-76)Online publication date: 11-Oct-2009
  • (2008)Energy minimization with loop fusion and multi-functional-unit scheduling for multidimensional DSPJournal of Parallel and Distributed Computing10.1016/j.jpdc.2007.06.01468:4(443-455)Online publication date: 1-Apr-2008
  • (2005)Maximum Loop Distribution and Fusion for Two-level Loops Considering Code SizeProceedings of the 8th International Symposium on Parallel Architectures,Algorithms and Networks10.1109/ISPAN.2005.58(126-131)Online publication date: 7-Dec-2005
  • (2005)Parallel embedded systems: optimizations and challengesConference, Emerging Information Technology 2005.10.1109/EITC.2005.1544328(5-8)Online publication date: 2005
  • (2005)Loop distribution and fusion with timing and code size optimization for embedded DSPsProceedings of the 2005 international conference on Embedded and Ubiquitous Computing10.1007/11596356_15(121-130)Online publication date: 6-Dec-2005

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media