Skip to main content
Log in

Loop Distribution and Fusion with Timing and Code Size Optimization

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

In this paper, a technique that combines loop distribution with maximum direct loop fusion (LD_MDF) is proposed. The technique performs maximum loop distribution, followed by maximum direct loop fusion to optimize timing and code size simultaneously. The loop distribution theorems that state the conditions distributing any multi-level nested loop in the maximum way are proved. It is proved that the statements involved in the dependence cycle can be fully distributed if the summation of the edge weight of the dependence cycle satisfies a certain condition; otherwise, the statements should be put in the same loop after loop distribution. Based on the loop distribution theorems, algorithms are designed to conduct maximum loop distribution. The maximum direct loop fusion problem is mapped to the graph partitioning problem. A polynomial graph partitioning algorithm is developed to compute the fusion partitions. It is proved that the proposed maximum direct loop fusion algorithm produces the fewest number of resultant loop nests without violating dependence constraints. It is also shown that the resultant code size of the fused loops by the technique of loop distribution with maximum direct loop fusion is smaller than the code size of the original loops when the number of fused loops is less than the number of the original loops. The simulation results are presented to validate the proposed technique.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17

Similar content being viewed by others

References

  1. Chandra, D., Guo, F., Kim, S., & Solihin, Y. (2005). Predicting inter-thread cache contention on a chip multi-processor architecture. In HPCA ’05: Proceedings of the 11th international symposium on high-performance computer architecture (pp. 340–351). Washington, DC: IEEE Computer Society.

    Google Scholar 

  2. Manjikian, N., & Abdelrahman, T. S. (1997). Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed System, 8, 193–209.

    Article  Google Scholar 

  3. McKinley, K.S., Carr, S., & Tseng, C.-W. (1996). Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems (TOPLAS), 18(4), 424–453.

    Article  Google Scholar 

  4. Kennedy, K., & Mckinley, K. S. (1992). Optimizing for parallelism and data locality. In Proc. of the 6th conference on supercomputing (pp. 323–334).

  5. Wolf, M. & Lam, M. (1991). A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4), 452–471.

    Article  Google Scholar 

  6. Darte, A., Schreiber, R., & Villard, G. (2005). Lattice-based memory allocation. IEEE Transactions on Computers, 54, 1242–1257.

    Article  Google Scholar 

  7. Burger, D., Goodman, J. R., & Kägi, A. (1996). Memory bandwidth limitations of future microprocessors. In ISCA ’96: Proceedings of the 23rd annual international symposium on computer architecture (pp. 78–89). New York: ACM.

    Chapter  Google Scholar 

  8. Hu, Q., Palkovic, M., & Kjeldsberg, P. G. (2004). Memory requirement optimization with loop fusion and loop shifting. In DSD ’04: Proceedings of the digital system design, EUROMICRO systems (pp. 272–278). Washington, DC: IEEE Computer Society.

    Google Scholar 

  9. Panda, P. R., Catthoor, F., Dutt, N. D., Danckaert, K., Brockmeyer, E., Kulkarni, C., et al. (2001). Data and memory optimization techniques for embedded systems. ACM Transactions on Design Automation of Electronic Systems (TODAES), 6(2), 149–206.

    Article  Google Scholar 

  10. Catthoor, F., de Greef, E., & Suytack, S. (1998). Custom memory management methodology: Exploration of memory organisation for embedded multimedia system design. Norwell: Kluwer Academic.

    MATH  Google Scholar 

  11. Wang, Z., Hu, S., & Sha, E. H.-M. (2003). Register aware scheduling for distributed cache clustered architecture. In Proc. IEEE/ACM 2003 ASP design automation conference, Kitakyusyu, Japan.

  12. Chen, F., O’Neil, T. W., & Sha, E. H.-M. (2000). Optimizing overall loop schedules using prefetching and partitioning. IEEE Transactions on Parallel and Distributed Systems, 11(6), 604–614.

    Article  Google Scholar 

  13. Wolfe, M. (1996). High performance compilers for parallel computing. Reading: Addison-Wesley.

    MATH  Google Scholar 

  14. Allen, R., & Kennedy, K. (2001). Optimizing compilers for modern architectures: A dependence-based approach. San Francisco: Morgan Kaufmann.

    Google Scholar 

  15. Kennedy, K., & Mckinley, K. S. (1990). Loop distribution with arbitrary control flow. In Proc. of the 1990 conference on supercomputing (pp. 407–416).

  16. Kennedy, K., & Mckinley, K. S. (1993). Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, 768, pp. 301–320.

  17. Sha, E. H.-M., O’Neil, T. W., & Passos, N. L. (2003). Efficient polynomial-time nested loop fusion with full parallelism. International Journal of Computers and Their Applications, 10(1), 9–24.

    Google Scholar 

  18. Abdelrahman, T. S., Sawaya, R. (2000). Increasing perfect nests in scientific programs. In Proc. of international conference on parallel and distributed computing and systems (pp. 279–285). Las Vegas, NV.

  19. Yi, Q., Kennedy, K., & Adve, V. (2004). Transforming complex loop nests for locality. The Journal Of Supercomputing, 27(3), 219–264.

    Article  MATH  Google Scholar 

  20. Yi, Q., & Kennedy, K. (2004). Improving memory hierarchy performance through combined loop interchange and multi-level fusion. International Journal of High Performance Computing Applications, 18(2), 237–253.

    Article  Google Scholar 

  21. Liu, M., Zhuge, Q., Shao, Z., & Sha, E. H.-M. (2004). General loop fusion technique for nested loops considering timing and code size. In Proc. ACM/IEEE international conference on compilers, architectures, and synthesis for embedded systems (CASES 2004) (pp. 190–201).

  22. Darte, A. (1999). On the complexity of loop fusion. In International conference on parallel architectures and compilation techniques (pp. 149–157).

  23. Verdoolaege, S., Bruynooghe, M., & Catthoor, F. (2003). Multi-dimensional incremental loop fusion for data locality. In Proc. of the application-specific systems, architectures, and processors (pp. 14–24).

  24. Cooper, K. D., Torczon, L. (2008). Engineering a compiler. San Francisco: Morgan Kaufmann.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Meilin Liu.

Additional information

This work is partially supported by NSF CCR-0309461, NSF IIS-0513669, WSU-666781, WSU-282025.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, M., Sha, E.H.M., Zhuge, Q. et al. Loop Distribution and Fusion with Timing and Code Size Optimization . J Sign Process Syst 62, 325–340 (2011). https://doi.org/10.1007/s11265-010-0465-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-010-0465-x

Keywords

Navigation