ABSTRACT
Loop tiling or blocking improves temporal locality by dividing the problem domain into tiles and then repeatedly accessing the data within a tile. While this reduces reuse, it also leads to an often ignored side-effect: breaking the streaming data access pattern. As a result, tiled codes are unable to exploit the sophisticated hardware prefetchers in present-day processors to extract extra performance.
In this work, we propose a tiling algorithm to leverage prefetching to boost the performance of tiled codes. To achieve this, we propose to tile for the last-level cache as opposed to tiling for higher levels of cache as generally recommended. This approach not only exposes streaming access patterns in the tiled code that are amenable for prefetching, but also allows for a reduction in the off-chip traffic to memory (and therefore, better scaling with the number of cores). As a result, although we tile for the last level cache, we effectively access the data in the higher levels of cache because the data is prefetched in time for computation. To achieve this, we propose an algorithm to select a tile size that aims to maximize data reuse and minimize conflict misses in the shared last-level cache in modern multi-core processors. We find that the combined effect of tiling for the last-level cache and effective hardware prefetching gives significant improvement over existing tiling algorithms that target higher level L1/L2 caches and do not leverage the hardware prefetchers. When run on an Intel 8-core machine using different problem sizes, it achieves an average improvement of 27% and 48% for smaller and larger problem sizes, respectively, over the best tile sizes selected by state-of-the-art algorithms.
- E. Athanasaki, N. Koziris, and P. Tsanakas. A tile size selection analysis for blocked array layouts. In INTERACT-2005. 9th Annual Workshop, pages 70--80. Google ScholarDigital Library
- A.-H. A. Badawy, A. Aggarwal, D. Yeung, and C.-W. Tseng. Evaluating the impact of memory system performance on software prefetching and locality optimizations. In ICS '01, pages 486--500. Google ScholarDigital Library
- V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling stencil computations to maximize parallelism. In SC '12, pages 1--11, 2012. Google ScholarDigital Library
- B. Bao and C. Ding. Defensive loop tiling for shared cache. In CGO '13, pages 1--11. Google ScholarDigital Library
- C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT '04, pages 7--16. Google ScholarDigital Library
- C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT '13, pages 7--16, Juan-les-Pins, France, September 2004. Google ScholarDigital Library
- J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using phipac: A portable, high-performance, ansi c coding methodology. In ICS '97, pages 340--347. Google ScholarDigital Library
- U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. In L. Hendren, editor, In CC'08, volume 4959 of Lecture Notes in Computer Science, pages 132--146. 2008. Google ScholarDigital Library
- J. Chame and S. Moon. A tile selection algorithm for data locality and cache interference. In ICS '99, pages 492--499. Google ScholarDigital Library
- C. Chen, J. Chame, and M. Hall. Chill: A framework for composing high-level loop transformations. U. of Southern California, Tech. Rep, pages 08--897, 2008.Google Scholar
- S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In PLDI'95, 30(6):279--290. Google ScholarDigital Library
- K. Cooper and J. Sandoval. Portable Techniques to Find Effective Memory Hierarchy Parameters. Technical report, 2011.Google Scholar
- C. ŢĂpuş, I.-H. Chung, and J. K. Hollingsworth. Active harmony: Towards automated performance tuning. In SC '02, pages 1--11. Google ScholarDigital Library
- Y. Ding, J. Ansel, K. Veeramachaneni, X. Shen, U. O'Reilly, and S. P. Amarasinghe. Autotuning algorithmic choice for input sensitivity. In In PLDI'15, pages 379--390. Google ScholarDigital Library
- J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. In TOMS'90, 16(1):1--17. Google ScholarDigital Library
- Z. Fang, S. Mehta, P.-C. Yew, A. Zhai, J. Greensky, G. Beeraka, and B. Zang. Measuring microarchitectural details of multi- and many-core memory systems through microbench marking. ACMTrans. Archit. Code Optim., 11(4):55:1--55:26, Jan. 2015. Google ScholarDigital Library
- M. Frigo. A fast fourier transform compiler. In PLDI '99, pages 169--180. Google ScholarDigital Library
- S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical representation of cache misses. In ICS '97, pages 317--324. Google ScholarDigital Library
- J. Holewinski, L.-N. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In ICS '12, pages 311--320, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- D. Kim, S. S.-w. Liao, P. H. Wang, J. d. Cuvillo, X. Tian, X. Zou, H. Wang, D. Yeung, M. Girkar, and J. P. Shen. Physical experimentation with prefetching helper threads on intel's hyper-threaded processors. In CGO '04. Google ScholarDigital Library
- J. Lee, H. Kim, and R. Vuduc. When prefetching works, when it doesn't, and why. In TACO'12, 9(1):2:1--2:29. Google ScholarDigital Library
- A. W. Lim, S.-W. Liao, and M. S. Lam. Blocking and array contraction across arbitrarily nested loops using affine partitioning. In PPoPP '01, pages 103--112. Google ScholarDigital Library
- S. Mehta, G. Beeraka, and P.-C. Yew. Tile size selection revisited. In TACO'13, 10(4):35:1--35:27. Google ScholarDigital Library
- S. Mehta, Z. Fang, A. Zhai, and P.-C. Yew. Multi-stage coordinated prefetching for present-day processors. In ICS '14, pages 73--82. Google ScholarDigital Library
- S. Moon and R. H. Saavedra. Hyperblocking: A data reorganization method to eliminate cache conflicts in tiled loop nests. Technical report, Conflicts in Tiled Loop Nests, USC-CS-98-671, USC Computer Science, 1998.Google Scholar
- T. C. Mowry,M. S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In ASPLOS-V'92, pages 62--73. Google ScholarDigital Library
- L.-N. Pouchet. Polybench Benchmark Suite. Available at http://www\-roc.inria.fr/~pouchet/software/polybench/.Google Scholar
- A. Qasem, K. Kennedy, and J. M. Mellor-Crummey. Automatic tuning of whole applications using direct search and a performance-based transformation system. In SC'06, 36(2):183--196. Google ScholarDigital Library
- M. Rahman, L.-N. Pouchet, and P. Sadayappan. Neural network assisted tile size selection. In IWAPT '2010.Google Scholar
- J. Reinders. VTune performance analyzer essentials.Google Scholar
- G. Rivera and C.-W. Tseng. A comparison of compiler tiling algorithms. In CC '99, pages 168--182. Google ScholarDigital Library
- G. Rivera and C.-W. Tseng. Tiling optimizations for 3d scientific computations. In SC'00. Google ScholarDigital Library
- R. Saavedra, W. Mao, D. Park, J. Chame, and S. Moon. The combined effectiveness of unimodular transformations, tiling, and software prefetching. In IPPS '96, pages 39--45. Google ScholarDigital Library
- J. Shirako, K. Sharma, N. Fauzia, L.-N. Pouchet, J. Ramanujam, P. Sadayappan, and V. Sarkar. Analytical bounds for optimal tile size selection. In CC'12, pages 101--121. Google ScholarDigital Library
- R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache accurate time skewing in iterative stencil computations. In ICPP '11, pages 571--581, Sept 2011. Google ScholarDigital Library
- Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The pochoir stencil compiler. In SPAA '11, pages 117--128, 2011. Google ScholarDigital Library
- A. Tiwari, C. Chen, J. Chame, M. Hall, and J. K. Hollingsworth. A scalable auto-tuning framework for compiler optimization. In IPDPS '09, pages 1--12. Google ScholarDigital Library
- F. G. Van Zee and R. A. van de Geijn. Blis: A framework for rapidly instantiating blas functionality. TOMS'15, 41(3):14:1--14:33. Google ScholarDigital Library
- R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimizations of software and the atlas project. In Parallel Computing, 27:3--35.Google ScholarDigital Library
- M. Wolfe. More iteration space tiling. In SC '89, pages 655--664. Google ScholarDigital Library
- Q. Yi and J. Guo. Extensive parameterization and tuning of architecture-sensitive optimizations. In ICCS'11, pages 2156--2165.Google Scholar
- Q. Yi, K. Seymour, H. You, R. W. Vuduc, and D. J. Quinlan. POET: parameterized optimizations for empirical tuning. In IPDPS'07, pages 1--8.Google Scholar
Index Terms
- TurboTiling: Leveraging Prefetching to Boost Performance of Tiled Codes
Recommendations
Coordinated control of multiple prefetchers in multi-core systems
MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on MicroarchitectureAggressive prefetching is very beneficial for memory latency tolerance of many applications. However, it faces significant challenges in multi-core systems. Prefetchers of different cores on a chip multiprocessor (CMP) can cause significant interference ...
Defensive loop tiling for multi-core processor
MSPC '12: Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and CorrectnessLoop tiling is a compiler transformation that tailors an application's working set to fit in a cache hierarchy. On today's multicore processors, part of the hierarchy, especially the last level cache (LLC) is shared. In this paper, we show that cache ...
Criticality aware tiered cache hierarchy: a fundamental relook at multi-level cache hierarchies
ISCA '18: Proceedings of the 45th Annual International Symposium on Computer ArchitectureOn-die caches are a popular method to help hide the main memory latency. However, it is difficult to build large caches without substantially increasing their access latency, which in turn hurts performance. To overcome this difficulty, on-die caches ...
Comments