skip to main content
10.1145/2925426.2926288acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

TurboTiling: Leveraging Prefetching to Boost Performance of Tiled Codes

Authors Info & Claims
Published:01 June 2016Publication History

ABSTRACT

Loop tiling or blocking improves temporal locality by dividing the problem domain into tiles and then repeatedly accessing the data within a tile. While this reduces reuse, it also leads to an often ignored side-effect: breaking the streaming data access pattern. As a result, tiled codes are unable to exploit the sophisticated hardware prefetchers in present-day processors to extract extra performance.

In this work, we propose a tiling algorithm to leverage prefetching to boost the performance of tiled codes. To achieve this, we propose to tile for the last-level cache as opposed to tiling for higher levels of cache as generally recommended. This approach not only exposes streaming access patterns in the tiled code that are amenable for prefetching, but also allows for a reduction in the off-chip traffic to memory (and therefore, better scaling with the number of cores). As a result, although we tile for the last level cache, we effectively access the data in the higher levels of cache because the data is prefetched in time for computation. To achieve this, we propose an algorithm to select a tile size that aims to maximize data reuse and minimize conflict misses in the shared last-level cache in modern multi-core processors. We find that the combined effect of tiling for the last-level cache and effective hardware prefetching gives significant improvement over existing tiling algorithms that target higher level L1/L2 caches and do not leverage the hardware prefetchers. When run on an Intel 8-core machine using different problem sizes, it achieves an average improvement of 27% and 48% for smaller and larger problem sizes, respectively, over the best tile sizes selected by state-of-the-art algorithms.

References

  1. E. Athanasaki, N. Koziris, and P. Tsanakas. A tile size selection analysis for blocked array layouts. In INTERACT-2005. 9th Annual Workshop, pages 70--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A.-H. A. Badawy, A. Aggarwal, D. Yeung, and C.-W. Tseng. Evaluating the impact of memory system performance on software prefetching and locality optimizations. In ICS '01, pages 486--500. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling stencil computations to maximize parallelism. In SC '12, pages 1--11, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. Bao and C. Ding. Defensive loop tiling for shared cache. In CGO '13, pages 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT '04, pages 7--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT '13, pages 7--16, Juan-les-Pins, France, September 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using phipac: A portable, high-performance, ansi c coding methodology. In ICS '97, pages 340--347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. In L. Hendren, editor, In CC'08, volume 4959 of Lecture Notes in Computer Science, pages 132--146. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Chame and S. Moon. A tile selection algorithm for data locality and cache interference. In ICS '99, pages 492--499. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Chen, J. Chame, and M. Hall. Chill: A framework for composing high-level loop transformations. U. of Southern California, Tech. Rep, pages 08--897, 2008.Google ScholarGoogle Scholar
  11. S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In PLDI'95, 30(6):279--290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. Cooper and J. Sandoval. Portable Techniques to Find Effective Memory Hierarchy Parameters. Technical report, 2011.Google ScholarGoogle Scholar
  13. C. ŢĂpuş, I.-H. Chung, and J. K. Hollingsworth. Active harmony: Towards automated performance tuning. In SC '02, pages 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Ding, J. Ansel, K. Veeramachaneni, X. Shen, U. O'Reilly, and S. P. Amarasinghe. Autotuning algorithmic choice for input sensitivity. In In PLDI'15, pages 379--390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. In TOMS'90, 16(1):1--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Z. Fang, S. Mehta, P.-C. Yew, A. Zhai, J. Greensky, G. Beeraka, and B. Zang. Measuring microarchitectural details of multi- and many-core memory systems through microbench marking. ACMTrans. Archit. Code Optim., 11(4):55:1--55:26, Jan. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Frigo. A fast fourier transform compiler. In PLDI '99, pages 169--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical representation of cache misses. In ICS '97, pages 317--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Holewinski, L.-N. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In ICS '12, pages 311--320, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Kim, S. S.-w. Liao, P. H. Wang, J. d. Cuvillo, X. Tian, X. Zou, H. Wang, D. Yeung, M. Girkar, and J. P. Shen. Physical experimentation with prefetching helper threads on intel's hyper-threaded processors. In CGO '04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Lee, H. Kim, and R. Vuduc. When prefetching works, when it doesn't, and why. In TACO'12, 9(1):2:1--2:29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. W. Lim, S.-W. Liao, and M. S. Lam. Blocking and array contraction across arbitrarily nested loops using affine partitioning. In PPoPP '01, pages 103--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Mehta, G. Beeraka, and P.-C. Yew. Tile size selection revisited. In TACO'13, 10(4):35:1--35:27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Mehta, Z. Fang, A. Zhai, and P.-C. Yew. Multi-stage coordinated prefetching for present-day processors. In ICS '14, pages 73--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Moon and R. H. Saavedra. Hyperblocking: A data reorganization method to eliminate cache conflicts in tiled loop nests. Technical report, Conflicts in Tiled Loop Nests, USC-CS-98-671, USC Computer Science, 1998.Google ScholarGoogle Scholar
  26. T. C. Mowry,M. S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In ASPLOS-V'92, pages 62--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. L.-N. Pouchet. Polybench Benchmark Suite. Available at http://www\-roc.inria.fr/~pouchet/software/polybench/.Google ScholarGoogle Scholar
  28. A. Qasem, K. Kennedy, and J. M. Mellor-Crummey. Automatic tuning of whole applications using direct search and a performance-based transformation system. In SC'06, 36(2):183--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Rahman, L.-N. Pouchet, and P. Sadayappan. Neural network assisted tile size selection. In IWAPT '2010.Google ScholarGoogle Scholar
  30. J. Reinders. VTune performance analyzer essentials.Google ScholarGoogle Scholar
  31. G. Rivera and C.-W. Tseng. A comparison of compiler tiling algorithms. In CC '99, pages 168--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. G. Rivera and C.-W. Tseng. Tiling optimizations for 3d scientific computations. In SC'00. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. Saavedra, W. Mao, D. Park, J. Chame, and S. Moon. The combined effectiveness of unimodular transformations, tiling, and software prefetching. In IPPS '96, pages 39--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Shirako, K. Sharma, N. Fauzia, L.-N. Pouchet, J. Ramanujam, P. Sadayappan, and V. Sarkar. Analytical bounds for optimal tile size selection. In CC'12, pages 101--121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache accurate time skewing in iterative stencil computations. In ICPP '11, pages 571--581, Sept 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The pochoir stencil compiler. In SPAA '11, pages 117--128, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Tiwari, C. Chen, J. Chame, M. Hall, and J. K. Hollingsworth. A scalable auto-tuning framework for compiler optimization. In IPDPS '09, pages 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. F. G. Van Zee and R. A. van de Geijn. Blis: A framework for rapidly instantiating blas functionality. TOMS'15, 41(3):14:1--14:33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimizations of software and the atlas project. In Parallel Computing, 27:3--35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. Wolfe. More iteration space tiling. In SC '89, pages 655--664. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Q. Yi and J. Guo. Extensive parameterization and tuning of architecture-sensitive optimizations. In ICCS'11, pages 2156--2165.Google ScholarGoogle Scholar
  42. Q. Yi, K. Seymour, H. You, R. W. Vuduc, and D. J. Quinlan. POET: parameterized optimizations for empirical tuning. In IPDPS'07, pages 1--8.Google ScholarGoogle Scholar

Index Terms

  1. TurboTiling: Leveraging Prefetching to Boost Performance of Tiled Codes

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICS '16: Proceedings of the 2016 International Conference on Supercomputing
      June 2016
      547 pages
      ISBN:9781450343619
      DOI:10.1145/2925426

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 June 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate584of2,055submissions,28%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader