skip to main content
10.1145/514191.514227acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

Computation regrouping: restructuring programs for temporal data cache locality

Published:22 June 2002Publication History

ABSTRACT

Data access costs contribute significantly to the execution time of applications with complex data structures. As the latency of memory accesses becomes high relative to processor cycle times, application performance is increasingly limited by memory performance. In some situations it may be reasonable to trade increased computation costs for reduced memory costs. The contributions of this paper are three-fold: we provide a detailed analysis of the memory performance of a set of seven, memory-intensive benchmarks; we describe Computation Regrouping, a general, source-level approach to improving the overall performance of these applications by improving temporal locality to reduce cache and TLB miss ratios (and thus memory stall times); and we demonstrate significant performance improvements from applying Computation Regrouping to our suite of seven benchmarks. With Computation Regrouping, we observe an average speedup of 1.97, with individual speedups ranging from 1.26 to 3.03. Most of this improvement comes from eliminating memory stall time.

References

  1. A. Appel, J. Ellis, and K. Li. Real-time concurrent collection on stock multiprocessors. In Proceedings of the 1988 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 11--20, June 1988]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. Arge. The Buffer Tree : A New Technique for Optimal I/O-Algorithms. In Fourth Workshop on Algorithms and Data Structures, pages 334--345, August 1995]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. Arge, K. Hinrichs, J. Vahrenhold, and J. S. Vitter. Efficient Bulk Operations on Dynamic R-Trees. In 1st Workshop on Algorithm Engineering and Experimentation, pages 328--348, January 1999]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. A. Bender, E. D. Demaine, and M. Farach-Colton. Cache-Oblivious B-Trees. In Proceedings of the 41stAnnual Symposium on Foundations of Computer Science, pages 399--409, November 2000]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. In Proceedings of the 4th Symposium on Architectural Support for Programming Languages and Operating Systems, pages 40--52, April 1991]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Carlisle, A. Rogers, J. Reppy, and L. Hendren. Early experiences with olden. In Proceedings of the Sixth Workshop on Languages and Compilers for Parallel Computing, pages 1--20, August 1993]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Carr, K. McKinley, and C.-W. Tseng. Compiler Optimizations for Improving Data Locality. In Proceedings of the 6th Symposium on Architectural Support for Programming Languages and Operating Systems, pages 252--262, October 1994]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Chatterjee, A. Lebeck, P. Patnala, and M. Thottethodi. Recursive Array Layouts and Fast Matrix Multiplication. In Proceedings of the 11th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, June 1999]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-Conscious Structure Layout. In Proceedings of the 1999 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 1--12, May 1999]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. E. Culler, A. Dusseau, S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick. Parallel Programming in Split-C. In Proceedings of Supercomputing '93, pages 262--273, November 1993]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Ding and K. Kennedy. Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse. In 2001 International Parallel and Distributed Processing Symposium, April 2001]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Frigo and S. Johnson. FFTW: An adaptive software architecture for the FFT. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 1381--1384, May 1998]]Google ScholarGoogle ScholarCross RefCross Ref
  13. M. Frigo, C. Leiserson, H. Prokop, and S. Ramachandran. Cache-Oblivious Algorithms. In 40th Annual Symposium on Foundations of Computer Science, pages 285--297, October 1999]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Ghosh, M. Martonosi, and S. Malik. Precise Miss Analysis for Program Transformations with Caches of Arbitrary Associativity. In Architectural Support for Programming Languages and Operating Systems, pages 228--239, October 1998]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Guttmann. R-Trees : A Dynamic Index Structure for Spatial Searching. In Proceedings of the 1984 International Conference on Management of Data, pages 47--57, August 1984]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H. Han and C. Tseng. Improving Compiler and Run-Time Support for Irregular Reductions. In Proceedings of the 11th Workshop on Languages and Compilers for Parallel Computing, Chapel Hill, NC, August 1998]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H. Han and C.-W. Tseng. Improving locality for adaptive irregular scientific codes. Technical Report CS-TR-4039, University of Maryland, College Park, September 1999]]Google ScholarGoogle Scholar
  18. M. T. Kandemir, A. N. Choudhary, J. Ramanujam, and P. Banerjee. Improving locality using loop and data transformations in an integrated framework. In International Symposium on Microarchitecture, pages 285--297, November-December 1998]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Karlsson, F. Dahlgren, and P. Stenstrom. A Prefetching Technique for Irregular Accesses to Linked Data Structures. In Proceedings of the Sixth Annual Symposium on High Performance Computer Architecture, pages 206--217, January 2000]]Google ScholarGoogle Scholar
  20. I. Kodukula, N. Ahmed, and K. Pingali. Data-Centric Multi-level Blocking. In Proceedings of the SIGPLAN '97 Conference on Programming Language Design and Implementation, pages 346--357, June 1997]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. I. Kodukula, N. Ahmed, and K. Pingali. Data-Centric Multi-level Blocking. In Proceedings of the 1997 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 346--357, June 1997]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the 4th ASPLOS, pages 63--74, April 1991]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. G. LaMarca. Caches and Algorithms. PhD thesis, University of Washington, 1996]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Leung and J. Zahorjan. Optimizing Data Locality by Array Restructuring. Technical Report UW-CSE-95-09-01, University of Washington Dept. of Computer Science and Engineering, September 1995]]Google ScholarGoogle Scholar
  25. C.-K. Luk and T. C. Mowry. Compiler-Based Prefetching for Recursive Data Structure. In Proceedings of the 7th Symposium on Architectural Support for Programming Languages and Operating Systems, pages 222--233, October 1996]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. W. Manke and J. Wu. Data-Intensive System Benchmark Suite Analysis and Specification. Atlantic Aerospace Electronics Corp., June 1999]]Google ScholarGoogle Scholar
  27. V. Pingali. Memory performance of complex data structures: Characterization and optimization. Master's thesis, University of Utah, August 2001]]Google ScholarGoogle Scholar
  28. J. Rao and K. A. Ross. Cache Conscious Indexing for Decision-Support in Main Memory. In Proceedings of the 25th VLDB Conference, pages 78--89, 1999]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Rao and K. A. Ross. Making B+-Trees Cache Conscious in Main Memory. In Proceedings of the 26th VLDB Conference, pages 475--486, 2000]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. E. S. Roberts and M. T. Vandevoorde. WorkCrews: An Abstraction for Controlling Parallelism. Technical Report SRC-042, Digital Systems Research Center, April 1989]]Google ScholarGoogle Scholar
  31. A. Rogers and K. Pingali. Process decomposition through locality of reference. In Proceedings of the 1989 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 69--80, June 1989]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Shatdal, C. Kant, and J. Naughton. Cache Conscious Algorithms for Relational Query Processing. In Proceedings of the 20th VLDB Conference, pages 510--521, September 1994]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Silicon Graphics Inc. SpeedShop User's Guide. 1996]]Google ScholarGoogle Scholar
  34. F. Somenzi. CUDD: CU Decision Diagram Package Release 2.3.1, 2001]]Google ScholarGoogle Scholar
  35. Y. Song and Z. Li. New Tiling Techniques to Improve Cache Temporal Locality. In Proceedings of the SIGPLAN '99 Conference on Programming Language Design and Implementation, pages 215--228, May 1999]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. D. N. Truong, F. Bodin, and A. Seznec. Improving Cache Behavior of Dynamically Allocated Data Structures. In Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques, pages 322--329, October 1998]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Zagha, B. Larson, S. Turner, and M. Itzkowitz. Performance Analysis Using the MIPS R10000 Performance Counters. In Proceedings of Supercomputing '96, November 1996]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Computation regrouping: restructuring programs for temporal data cache locality

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ICS '02: Proceedings of the 16th international conference on Supercomputing
          June 2002
          338 pages
          ISBN:1581134835
          DOI:10.1145/514191

          Copyright © 2002 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 22 June 2002

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          ICS '02 Paper Acceptance Rate31of144submissions,22%Overall Acceptance Rate584of2,055submissions,28%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader