Skip to main content

Cross-loop reuse analysis and its application to cache optimizations

  • Automatic Data Distribution and Locality Enhancement
  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1239))

Abstract

In this paper we describe the design of a data-flow framework for detecting cross-loop reuse. Cross-loop reuse takes place when a set of data items or cache lines is accessed in a given loop nest and then accessed again within some subsequent portion of the program, usually another outer loop nest. In contrast to intra-loop reuse, which occurs during the execution of a single loop nest, cross-loop reuse is hard to analyze using traditional dependence-based techniques. The framework we have constructed is based on a combination of array section analysis (to capture array access patterns at a high level) and data-flow analysis (to deal with intra-procedural control flow). The framework is designed to account for cache size when gathering reuse information, and when used in an interprocedural setting, the framework also provides a mechanism for summarizing the effects of procedure calls.

Cross-loop reuse information can be used to drive a number of transformations that enhance locality and improve cache utilization, including loop fusion and loop reversal. Although these transformations are not new, their impact on cache behavior has not always been possible to predict, making them difficult to apply. As part of this paper we report the results of a comprehensive experimental study in which we apply our techniques to a set of ten programs from the SPEC95 floating point benchmark suite. We were able to obtain modest performance gains overall for several of the programs, based mostly on improvements in cache utilization.

This work was supported in part by ARPA (Army Contract DABT63-95-C-0115).

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. B. Appelbe and B. Lakshmanan. Program transformations for locality using affinity regions. In Proceedings of the Sixth Workshop on Languages and Compilers for Parallel Computing, Portland, OR, August 1993.

    Google Scholar 

  2. V. Balasundaram. Interactive Parallelization of Numerical Scientific Programs. PhD thesis, Dept. of Computer Science, Rice University, May 1989.

    Google Scholar 

  3. V. Balasundaram. A mechanism for keeping useful internal information in parallel programming tools: The data access descriptor. Journal of Parallel and Distributed Computing, 9(2):154–170, June 1990.

    Article  Google Scholar 

  4. U. Banerjee. Dependence Analysis for Supercomputing. Kluwer Academic Publishers, Boston, MA, 1988.

    Google Scholar 

  5. M. Burke and R. Cytron. Interprocedural dependence analysis and parallelization. In Proceedings of the SIGPLAN '86 Symposium on Compiler Construction, Palo Alto, CA, June 1986.

    Google Scholar 

  6. D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subcripted variables. In Proceedings of the SIGPLAN '90 Conference on Programming Language Design and Implementation, White Plains, NY, June 1990.

    Google Scholar 

  7. D. Callahan, J. Cocke, and K. Kennedy. Analysis of interprocedural side effects in a parallel programming environment. Journal of Parallel and Distributed Computing, 5(5):517–550, October 1988.

    Article  Google Scholar 

  8. S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler optimizations for improving data locality. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI), San Jose, CA, October 1994.

    Google Scholar 

  9. R. Cmelik and D. Keppel. Shade: A fast instruction-set simulator for execution profiling. Technical Report SMLI 93-12; UWCSE 93-06-06, Sun Microsystems Laboratories, Inc. and University of Washington, 1993.

    Google Scholar 

  10. J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing cache effectiveness. In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing, Fourth International Workshop, Santa Clara, CA, August 1991. Springer-Verlag.

    Google Scholar 

  11. T. Gross and P. Steenkiste. Structured dataflow analysis for arrays and its use in an optimizing compiler. Software—Practice and Experience, 20(2):133–155, February 1990.

    Google Scholar 

  12. M. Gupta, E. Schonberg, and H. Srinivasan. A unified data-flow framework for optimizing communication. In Proceedings of the Seventh Workshop on Languages and Compilers for Parallel Computing, Ithaca, NY, August 1994.

    Google Scholar 

  13. R. v. Hanxleden. Compiler Support for Machine-Independent Parallelization of Irregular Problems. PhD thesis, Dept. of Computer Science, Rice University, December 1994.

    Google Scholar 

  14. R. v. Hanxleden and K. Kennedy. Give-N-Take — A balanced code placement framework. In Proceedings of the SIGPLAN '94 Conference on Programming Language Design and Implementation, Orlando, FL, June 1994.

    Google Scholar 

  15. P. Havlak and K. Kennedy. An implementation of interprocedural bounded regular section analysis. IEEE Transactions on Parallel and Distributed Systems, 2(3):350–360, July 1991.

    Article  Google Scholar 

  16. K. Kennedy and K. S. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Proceedings of the Sixth Workshop on Languages and Compilers for Parallel Computing, Portland, OR, August 1993.

    Google Scholar 

  17. W. Li and K. Pingali. Access normalization: Loop restructuring for NUMA compilers. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-V), Boston, MA, October 1992.

    Google Scholar 

  18. R. E. Tarjan. Testing flow graph reducibility. Journal of Computer and System Sciences, 9:355–365, 1974.

    Google Scholar 

  19. J. Uniejewski. SPEC Benchmark Suite: Designed for today's advanced systems. SPEC Newsletter Volume 1, Issue 1, SPEC, Fall 1989.

    Google Scholar 

  20. M. E. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation, Toronto, Canada, June 1991.

    Google Scholar 

  21. M. J. Wolfe. Optimizing Supercompilers for Supercomputers. The MIT Press, Cambridge, MA, 1989.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

David Sehr Utpal Banerjee David Gelernter Alex Nicolau David Padua

Rights and permissions

Reprints and permissions

Copyright information

© 1997 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cooper, K., Kennedy, K., McIntosh, N. (1997). Cross-loop reuse analysis and its application to cache optimizations. In: Sehr, D., Banerjee, U., Gelernter, D., Nicolau, A., Padua, D. (eds) Languages and Compilers for Parallel Computing. LCPC 1996. Lecture Notes in Computer Science, vol 1239. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0017242

Download citation

  • DOI: https://doi.org/10.1007/BFb0017242

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-63091-3

  • Online ISBN: 978-3-540-69128-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics