Skip to main content
Log in

Achieving Scalable Locality with Time Skewing

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Microprocessor speed has been growing exponentially faster than memory system speed in the recent past. This paper explores the long term implications of this trend. We define scalable locality, which measures our ability to apply ever faster processors to increasingly large problems (just as scalable parallelism measures our ability to apply more numerous processors to larger problems). We provide an algorithm called time skewing that derives an execution order and storage mapping to produce any desired degree of locality, for certain programs that can be made to exhibit scalable locality. Our approach is unusual in that it derives the transformation from the algorithm's dataflow (a fundamental characteristic of the algorithm) instead of searching a space of transformations of the execution order and array layout used by the programmer (artifacts of the expression of the algorithm). We provide empirical results for data sets using L2 cache, main memory, and virtual memory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

REFERENCES

  1. John D. McCalpin, Memory Bandwidth and Machine Balance in Current High Performance Computers, IEEE Technical Committee on Computer Architecture Newsletter (December 1995).

  2. F. Irigoin and R. Triolet, Supernode Partitioning, In Conf. Record of the 15th ACM Symp. Principles Progr. Lang., pp. 319–329 (1988).

  3. Michael E. Wolf and Monica S. Lam, A Data Locality Optimizing Algorithm, ACM SIGPLAN Conf. Progr. Lang. Design and Implementation (1991).

  4. Michael Edward Wolf, Improving Locality and Parallelism in Nested Loops, Ph.D. thesis, Dept. of Computer Science, Stanford University (August 1992).

  5. K.S. McKinley, S. Carr, and C.-W. Tseng, Improving Data Locality with Loop Transformations, ACM Trans. Progr. Lang. Syst., 18(4): 424–453 (1996).

    Google Scholar 

  6. Gerald Roth, John Mellor-Crummey, Ken Kennedy, and R. Gregg Brickner, Compiling Stencils in High Performance Fortran, Proc. SC '97: High Performance Networking and Computing (November 1997).

  7. R. Eigenmann, J. Hoeflinger, Z. Li, and D. Padua, Experience in the Automatic Parallelization of 4 Perfect Benchmark Programs, In Proc. 4th Workshop on Progr. Lang. Compilers for Parallel Computing (August 1991). Also Technical Report 1193, CSRD, University of Illinois.

    Google Scholar 

  8. R. Eigenmann, J. Hoeflinger, and D. Padua, On the Automatic Parallelization of the Perfect Benchmarks. IEEE Trans. Parallel Distributed Systems, 9(1):5–23 (January 1998). Also Technical Report 1392, CSRD, University of Illinois.

    Google Scholar 

  9. Tina Shen and David Wonnacott, Code Generation for Memory Mappings, Mid-Atlantic Student Workshop on Progr. Lang. Syst. (MASPLAS '98) (April 1998). An updated version is available as http://www.haverford.edu/cmsc/davew/cache-opt/mmap.ps.

  10. David Wonnacott, Time Skewing for Parallel Computers, Proc. 12th Int'l. Workshop on Lang. Compilers for Parallel Computing, Vol. 1863, Springer-Verlag, Lecture Notes in Computer Science, pp. 477–480 (August 1999).

    Google Scholar 

  11. David Wonnacott, Using Time Skewing to Eliminate Idle Time Due to Memory Bandwidth and Network Limitations, Proc. Int'l. Parallel and Distributed Proc. Symp. (May 2000).

  12. Yonghong Song and Zhiyuan Li, New Tiling Techniques to Improve Cache Temporal Locality, ACM SIGPLAN '99 Conf. Progr. Lang. Design and Implementation, pp. 215–228 (May 1999).

  13. Yonghong Song, Rong Xu, Cheng Wang, and Zhiyuan Li, Data Locality Enhancement by Memory Reduction, Proc. 15th Int'l. Conf. Supercomputing (June 2001).

  14. D. Callahan, J. Cocke, and K. Kennedy, Estimating Interlock and Improving Balance for Pipelined Machines, J. Parallel and Distributed Computing, 5(4): 334–358 (August 1988).

    Google Scholar 

  15. Robert Sedgewick, Algorithms in C++, Addison-Wesley, Third Edition (1998).

  16. M. Lam, E. Rothberg, and M. Wolf, The Cache Performance and Optimizations of Blocked Algorithms, Fourth Int'l. Conf. Architectural Support for Progr. Lang. Operat. Syst. (April 1991).

  17. O. Temam, E. Granston, and W. Jalby, To Copy or Not to Copy: A Compile-Time Technique for Assessing when Data Copying Should be Used to Eliminate Cache Conflicts, Proc. Supercomputing'93 (November 1993).

  18. Todd C. Mowry, Monica S. Lam, and Anoop Gupta, Design and Evaluation of a Compiler Algorithm for Prefetching, Proc. Fifth Int'l. Conf. Architectural Support Progr. Lang. Operat. Syst., pp. 62–73 (October 1992).

  19. Elliot Waingold, Michael Taylor, Devabhaktuni Srikrishna, Vivek Sarkar, Walter Lee, Victor Lee, Jang Kim, Matthew Frank, Peter Finch, Rajeev Barua, Jonathan Babb, Saman Amarasinghe, and Anant Agarwal, Baring it All to Software: Raw Machines, IEEE Computer, pp. 86–93 (September 1997).

  20. Samuel Larsen, Emmett Witchel, and Saman Amarasinghe, Techniques for Increasing and Detecting Memory Alignment, Technical Report LCS-TM-621, MIT/LCS (November 2001).

  21. M. J. Wolfe, Optimizing Supercompilers for Supercomputers, The MIT Press, Cambridge, Massachusetts (1989).

    Google Scholar 

  22. Wayne Kelly and William Pugh, Determining Schedules Based on Performance Estimation, Parallel Processing Letters, 4(3):205–219 (September 1994).

    Google Scholar 

  23. William Pugh and David Wonnacott, An Exact Method for Analysis of Value-Based Array Data Dependences. Proc. Sixth Int'l. Workshop on Lang. Compilers for Parallel Computing, Vol. 768 of Lecture Notes in Computer Science. Springer-Verlag, Berlin (August 1993). Also available as Technical Report CS-TR-3196, Dept. of Computer Science, University of Maryland, College Park.

    Google Scholar 

  24. William Pugh and David Wonnacott, Constraint-Based Array Dependence Analysis, ACM Trans. Progr. Lang. Syst., 20(3):635–678 (May 1998), http://www.acm.org/pubs/ citations/journals/toplas/1998-20-3/p635-pugh/.

    Google Scholar 

  25. Wayne Kelly, William Pugh, and Evan Rosser, Code Generation for Multiple Mappings, Fifth Symp. Frontiers of Massively Parallel Computation, McLean, Virginia, pp. 332–341 (February 1995).

    Google Scholar 

  26. Wayne Kelly, Vadim Maslov, William Pugh, Evan Rosser, Tatiana Shpeisman, and David Wonnacott, The Omega Library interface guide, Technical Report CS-TR-3445, Dept. of Computer Science, University of Maryland, College Park, March 1995, The Omega library is available from http://www.cs.umd.edu/projects/omega.

  27. David Wonnacott, Extending Scalar Optimizations for Arrays, Proc. 13th Int'l. Workshop on Lang. Compilers for Parallel Computing, Vol. 2017, Springer-Verlag, Lecture Notes in Computer Science, pp. 97-111 (August 2000).

    Google Scholar 

  28. Evan J. Rosser, Fine-Grained Analysis of Array Computations, Ph.D. thesis, Dept. of Computer Science, The University of Maryland (September 1998).

  29. David Wonnacott, Achieving scalable locality with Time Skewing, Technical Report DCSTR-378, Dept. of Computer Science, Rutgers University (February 1999). Available as ftp://www.cs.rutgers.edu/pub/technical-reports/dcs-tr-378.ps.Z.

  30. M. Weiser, Program Slicing, IEEE Trans. Software Engng., pp. 352–357 (July 1984).

  31. William Pugh, Counting Solutions to Presburger Formulas: How and Why. In SIGPLAN Conf. Progr. Lang. Design and Implementation, Orlando, Florida (June 1994).

  32. William Pugh and David Wonnacott, Eliminating False Data Dependences Using the Omega Test. In SIGPLAN Conf. Progr. Lang. Design and Implementation, pp. 140–151, San Francisco, California (June 1992).

  33. Qing Yi, Vikram S. Adve, and Ken Kennedy, Transforming Loops to Recursion for Multilevel Memory Hierarchies, SIGPLAN Conf. Progr. Lang. Design and Implementation, pp. 169–181 (2000).

  34. Rohit Chandra, Ding-Kai Chen, Robert Cox, Dror E. Maydan, Nenad Nedeljkovic, and Jennifer M. Anderson, Data Distribution Support on Distributed Shared Memory Multiprocessors, In ACM SIGPLAN '97 Conf. Progr. Lang. Design and Implementation, pp. 334–345 (June 1997).

  35. D. Gannon and W. Jalby, Strategies for Cache and Local Memory Management by Global Program Transformation, J. Parallel and Distributed Computing, pp. 587–616 (1988).

  36. John McCalpin and David Wonnacott, Time Skewing: A Value-Based Approach to Optimizing for Memory Locality, Technical Report DCS-TR-379, Dept. of Computer Science, Rutgers University (February 1999), Available as ftp://www.cs.rutgers.edu/pub/ technical-reports/dcs-tr-379.ps.Z.

  37. Tina Shen, Jaime Spacco, and David Wonnacott, High MFLOP Rates for Out of Core Stencil Calculations Using Time Skewing, SC '97 Poster Session (November 1997). Available as http://www.haverford.edu/cmsc/davew/cache-opt/SC97poster.ps.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wonnacott, D. Achieving Scalable Locality with Time Skewing. International Journal of Parallel Programming 30, 181–221 (2002). https://doi.org/10.1023/A:1015460304860

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1015460304860

Navigation