ABSTRACT
Data access costs contribute significantly to the execution time of applications with complex data structures. As the latency of memory accesses becomes high relative to processor cycle times, application performance is increasingly limited by memory performance. In some situations it may be reasonable to trade increased computation costs for reduced memory costs. The contributions of this paper are three-fold: we provide a detailed analysis of the memory performance of a set of seven, memory-intensive benchmarks; we describe Computation Regrouping, a general, source-level approach to improving the overall performance of these applications by improving temporal locality to reduce cache and TLB miss ratios (and thus memory stall times); and we demonstrate significant performance improvements from applying Computation Regrouping to our suite of seven benchmarks. With Computation Regrouping, we observe an average speedup of 1.97, with individual speedups ranging from 1.26 to 3.03. Most of this improvement comes from eliminating memory stall time.
- A. Appel, J. Ellis, and K. Li. Real-time concurrent collection on stock multiprocessors. In Proceedings of the 1988 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 11--20, June 1988]] Google ScholarDigital Library
- L. Arge. The Buffer Tree : A New Technique for Optimal I/O-Algorithms. In Fourth Workshop on Algorithms and Data Structures, pages 334--345, August 1995]] Google ScholarDigital Library
- L. Arge, K. Hinrichs, J. Vahrenhold, and J. S. Vitter. Efficient Bulk Operations on Dynamic R-Trees. In 1st Workshop on Algorithm Engineering and Experimentation, pages 328--348, January 1999]] Google ScholarDigital Library
- M. A. Bender, E. D. Demaine, and M. Farach-Colton. Cache-Oblivious B-Trees. In Proceedings of the 41stAnnual Symposium on Foundations of Computer Science, pages 399--409, November 2000]] Google ScholarDigital Library
- D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. In Proceedings of the 4th Symposium on Architectural Support for Programming Languages and Operating Systems, pages 40--52, April 1991]] Google ScholarDigital Library
- M. Carlisle, A. Rogers, J. Reppy, and L. Hendren. Early experiences with olden. In Proceedings of the Sixth Workshop on Languages and Compilers for Parallel Computing, pages 1--20, August 1993]] Google ScholarDigital Library
- S. Carr, K. McKinley, and C.-W. Tseng. Compiler Optimizations for Improving Data Locality. In Proceedings of the 6th Symposium on Architectural Support for Programming Languages and Operating Systems, pages 252--262, October 1994]] Google ScholarDigital Library
- S. Chatterjee, A. Lebeck, P. Patnala, and M. Thottethodi. Recursive Array Layouts and Fast Matrix Multiplication. In Proceedings of the 11th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, June 1999]] Google ScholarDigital Library
- T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-Conscious Structure Layout. In Proceedings of the 1999 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 1--12, May 1999]] Google ScholarDigital Library
- D. E. Culler, A. Dusseau, S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick. Parallel Programming in Split-C. In Proceedings of Supercomputing '93, pages 262--273, November 1993]] Google ScholarDigital Library
- C. Ding and K. Kennedy. Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse. In 2001 International Parallel and Distributed Processing Symposium, April 2001]] Google ScholarDigital Library
- M. Frigo and S. Johnson. FFTW: An adaptive software architecture for the FFT. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 1381--1384, May 1998]]Google ScholarCross Ref
- M. Frigo, C. Leiserson, H. Prokop, and S. Ramachandran. Cache-Oblivious Algorithms. In 40th Annual Symposium on Foundations of Computer Science, pages 285--297, October 1999]] Google ScholarDigital Library
- S. Ghosh, M. Martonosi, and S. Malik. Precise Miss Analysis for Program Transformations with Caches of Arbitrary Associativity. In Architectural Support for Programming Languages and Operating Systems, pages 228--239, October 1998]] Google ScholarDigital Library
- A. Guttmann. R-Trees : A Dynamic Index Structure for Spatial Searching. In Proceedings of the 1984 International Conference on Management of Data, pages 47--57, August 1984]] Google ScholarDigital Library
- H. Han and C. Tseng. Improving Compiler and Run-Time Support for Irregular Reductions. In Proceedings of the 11th Workshop on Languages and Compilers for Parallel Computing, Chapel Hill, NC, August 1998]] Google ScholarDigital Library
- H. Han and C.-W. Tseng. Improving locality for adaptive irregular scientific codes. Technical Report CS-TR-4039, University of Maryland, College Park, September 1999]]Google Scholar
- M. T. Kandemir, A. N. Choudhary, J. Ramanujam, and P. Banerjee. Improving locality using loop and data transformations in an integrated framework. In International Symposium on Microarchitecture, pages 285--297, November-December 1998]] Google ScholarDigital Library
- M. Karlsson, F. Dahlgren, and P. Stenstrom. A Prefetching Technique for Irregular Accesses to Linked Data Structures. In Proceedings of the Sixth Annual Symposium on High Performance Computer Architecture, pages 206--217, January 2000]]Google Scholar
- I. Kodukula, N. Ahmed, and K. Pingali. Data-Centric Multi-level Blocking. In Proceedings of the SIGPLAN '97 Conference on Programming Language Design and Implementation, pages 346--357, June 1997]] Google ScholarDigital Library
- I. Kodukula, N. Ahmed, and K. Pingali. Data-Centric Multi-level Blocking. In Proceedings of the 1997 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 346--357, June 1997]] Google ScholarDigital Library
- M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the 4th ASPLOS, pages 63--74, April 1991]] Google ScholarDigital Library
- A. G. LaMarca. Caches and Algorithms. PhD thesis, University of Washington, 1996]] Google ScholarDigital Library
- S. Leung and J. Zahorjan. Optimizing Data Locality by Array Restructuring. Technical Report UW-CSE-95-09-01, University of Washington Dept. of Computer Science and Engineering, September 1995]]Google Scholar
- C.-K. Luk and T. C. Mowry. Compiler-Based Prefetching for Recursive Data Structure. In Proceedings of the 7th Symposium on Architectural Support for Programming Languages and Operating Systems, pages 222--233, October 1996]] Google ScholarDigital Library
- J. W. Manke and J. Wu. Data-Intensive System Benchmark Suite Analysis and Specification. Atlantic Aerospace Electronics Corp., June 1999]]Google Scholar
- V. Pingali. Memory performance of complex data structures: Characterization and optimization. Master's thesis, University of Utah, August 2001]]Google Scholar
- J. Rao and K. A. Ross. Cache Conscious Indexing for Decision-Support in Main Memory. In Proceedings of the 25th VLDB Conference, pages 78--89, 1999]] Google ScholarDigital Library
- J. Rao and K. A. Ross. Making B+-Trees Cache Conscious in Main Memory. In Proceedings of the 26th VLDB Conference, pages 475--486, 2000]] Google ScholarDigital Library
- E. S. Roberts and M. T. Vandevoorde. WorkCrews: An Abstraction for Controlling Parallelism. Technical Report SRC-042, Digital Systems Research Center, April 1989]]Google Scholar
- A. Rogers and K. Pingali. Process decomposition through locality of reference. In Proceedings of the 1989 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 69--80, June 1989]] Google ScholarDigital Library
- A. Shatdal, C. Kant, and J. Naughton. Cache Conscious Algorithms for Relational Query Processing. In Proceedings of the 20th VLDB Conference, pages 510--521, September 1994]] Google ScholarDigital Library
- Silicon Graphics Inc. SpeedShop User's Guide. 1996]]Google Scholar
- F. Somenzi. CUDD: CU Decision Diagram Package Release 2.3.1, 2001]]Google Scholar
- Y. Song and Z. Li. New Tiling Techniques to Improve Cache Temporal Locality. In Proceedings of the SIGPLAN '99 Conference on Programming Language Design and Implementation, pages 215--228, May 1999]] Google ScholarDigital Library
- D. N. Truong, F. Bodin, and A. Seznec. Improving Cache Behavior of Dynamically Allocated Data Structures. In Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques, pages 322--329, October 1998]] Google ScholarDigital Library
- M. Zagha, B. Larson, S. Turner, and M. Itzkowitz. Performance Analysis Using the MIPS R10000 Performance Counters. In Proceedings of Supercomputing '96, November 1996]] Google ScholarDigital Library
Index Terms
- Computation regrouping: restructuring programs for temporal data cache locality
Recommendations
Restructuring computations for temporal data cache locality
Data access costs contribute significantly to the execution time of applications with complex data structures. A the latency of memory accesses becomes high relative to processor cycle times, application performance is increasingly limited by memory ...
Data access history cache and associated data prefetching mechanisms
SC '07: Proceedings of the 2007 ACM/IEEE conference on SupercomputingData prefetching is an effective way to bridge the increasing performance gap between processor and memory. As computing power is increasing much faster than memory performance, we suggest that it is time to have a dedicated cache to store data access ...
Comments