ABSTRACT
Time skewing and loop tiling has been known for a long time to be a highly beneficial acceleration technique for nested loops especially on bandwidth hungry multi-core processors, but it is little used in practice because efficient implementations utilize complicated code and simple or abstract ones show much smaller gains over naive nested loops. We break this dilemma with an essential time skewing scheme that is both compact and fast.
- M. M. Baskaran, A. Hartono, S. Tavarageri, T. Henretty, J. Ramanujam, and P. Sadayappan. Parametrized tiling revisited. In Proc. of the International Symposium on Code Generation and Optimization (CGO'10), 2010. Google ScholarDigital Library
- U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Not., 43 (6): 101--113, 2008. Google ScholarDigital Library
- M. Frigo and V. Strumpen. Cache oblivious stencil computations. In ICS'05: Proceedings of the 19th annual international conference on Supercomputing, pages 361--366. ACM, 2005. Google ScholarDigital Library
- M. Frigo and V. Strumpen. The cache complexity of multithreaded cache oblivious algorithms. In SPAA'06: Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures, pages 271--280, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- A. Hartono, M. M. Baskaran, C. Bastoul, A. Cohen, S. Krishnamoorthy, B. Norris, J. Ramanujam, and P. Sadayappan. Parametric multi-level tiling of imperfectly nested loops. In Proceedings of the 23rd International Conference on Supercomputing, pages 147--157, 2009. Google ScholarDigital Library
- S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Implicit and explicit optimizations for stencil computations. In MSPC'06: Proceedings of the 2006 workshop on Memory system performance and correctness, pages 51--60. ACM, 2006. Google ScholarDigital Library
- S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In International Parallel & Distributed Processing Symposium (IPDPS), 2010.Google ScholarCross Ref
- D. Kim, L. Renganarayanan, D. Rostron, S. V. Rajopadhye, and M. M. Strout. Multi-level tiling: M for the price of one. In Proceedings of the ACM/IEEE Conference on Supercomputing, page 51, 2007. Google ScholarDigital Library
- L. Liu and Z. Li. Improving parallelism and locality with asynchronous algorithms. In Proceedings ACM symposium on Principles and practice of parallel programming, PPoPP '10, pages 213--222, 2010. Google ScholarDigital Library
- R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache oblivious parallelograms in iterative stencil computations. In ICS'10: Proceedings of the 24th ACM International Conference on Supercomputing, pages 49--59. ACM, 2010. Google ScholarDigital Library
- M. Wittmann, G. Hager, and G. Wellein. Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory. In Proc. Workshop on Large-Scale Parallel Processing (LSPP'10) at IPDPS'10, 2010.Google ScholarCross Ref
- D. Wonnacott. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In Proceedings of International Parallel and Distributed Processing Symposium, 2000. Google ScholarDigital Library
Index Terms
- Time skewing made simple
Recommendations
Time skewing made simple
PPoPP '11Time skewing and loop tiling has been known for a long time to be a highly beneficial acceleration technique for nested loops especially on bandwidth hungry multi-core processors, but it is little used in practice because efficient implementations ...
Cache oblivious parallelograms in iterative stencil computations
ICS '10: Proceedings of the 24th ACM International Conference on SupercomputingWe present a new cache oblivious scheme for iterative stencil computations that performs beyond system bandwidth limitations as though gigabytes of data could reside in an enormous on-chip cache. We compare execution times for 2D and 3D spatial domains ...
Cache Accurate Time Skewing in Iterative Stencil Computations
ICPP '11: Proceedings of the 2011 International Conference on Parallel ProcessingWe present a time skewing algorithm that breaks the memory wall for certain iterative stencil computations. A stencil computation, even with constant weights, is a completely memory-bound algorithm. For example, for a large 3D domain of $500^3$ doubles ...
Comments