research-article

TurboTiling: Leveraging Prefetching to Boost Performance of Tiled Codes

Authors:
Sanyam Mehta

Department of Computer Science and Engineering, University of Minnesota, MN USA

Department of Computer Science and Engineering, University of Minnesota, MN USA
View Profile

,
Rajat Garg

Department of Computer Science and Engineering, University of Minnesota, MN USA

Department of Computer Science and Engineering, University of Minnesota, MN USA
View Profile

,
Nishad Trivedi

Department of Computer Science and Engineering, University of Minnesota, MN USA

Department of Computer Science and Engineering, University of Minnesota, MN USA
View Profile

,
Pen-Chung Yew

Department of Computer Science and Engineering, University of Minnesota, MN USA

Department of Computer Science and Engineering, University of Minnesota, MN USA
View Profile

ICS '16: Proceedings of the 2016 International Conference on SupercomputingJune 2016Article No.: 38Pages 1–12https://doi.org/10.1145/2925426.2926288

Published:01 June 2016Publication History

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Pages 1–12

ABSTRACT

Loop tiling or blocking improves temporal locality by dividing the problem domain into tiles and then repeatedly accessing the data within a tile. While this reduces reuse, it also leads to an often ignored side-effect: breaking the streaming data access pattern. As a result, tiled codes are unable to exploit the sophisticated hardware prefetchers in present-day processors to extract extra performance.

In this work, we propose a tiling algorithm to leverage prefetching to boost the performance of tiled codes. To achieve this, we propose to tile for the last-level cache as opposed to tiling for higher levels of cache as generally recommended. This approach not only exposes streaming access patterns in the tiled code that are amenable for prefetching, but also allows for a reduction in the off-chip traffic to memory (and therefore, better scaling with the number of cores). As a result, although we tile for the last level cache, we effectively access the data in the higher levels of cache because the data is prefetched in time for computation. To achieve this, we propose an algorithm to select a tile size that aims to maximize data reuse and minimize conflict misses in the shared last-level cache in modern multi-core processors. We find that the combined effect of tiling for the last-level cache and effective hardware prefetching gives significant improvement over existing tiling algorithms that target higher level L1/L2 caches and do not leverage the hardware prefetchers. When run on an Intel 8-core machine using different problem sizes, it achieves an average improvement of 27% and 48% for smaller and larger problem sizes, respectively, over the best tile sizes selected by state-of-the-art algorithms.

References

E. Athanasaki, N. Koziris, and P. Tsanakas. A tile size selection analysis for blocked array layouts. In INTERACT-2005. 9th Annual Workshop, pages 70--80. Google ScholarDigital Library
A.-H. A. Badawy, A. Aggarwal, D. Yeung, and C.-W. Tseng. Evaluating the impact of memory system performance on software prefetching and locality optimizations. In ICS '01, pages 486--500. Google ScholarDigital Library
V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling stencil computations to maximize parallelism. In SC '12, pages 1--11, 2012. Google ScholarDigital Library
B. Bao and C. Ding. Defensive loop tiling for shared cache. In CGO '13, pages 1--11. Google ScholarDigital Library
C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT '04, pages 7--16. Google ScholarDigital Library
C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT '13, pages 7--16, Juan-les-Pins, France, September 2004. Google ScholarDigital Library
J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using phipac: A portable, high-performance, ansi c coding methodology. In ICS '97, pages 340--347. Google ScholarDigital Library
U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. In L. Hendren, editor, In CC'08, volume 4959 of Lecture Notes in Computer Science, pages 132--146. 2008. Google ScholarDigital Library
J. Chame and S. Moon. A tile selection algorithm for data locality and cache interference. In ICS '99, pages 492--499. Google ScholarDigital Library
C. Chen, J. Chame, and M. Hall. Chill: A framework for composing high-level loop transformations. U. of Southern California, Tech. Rep, pages 08--897, 2008.Google Scholar
S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In PLDI'95, 30(6):279--290. Google ScholarDigital Library
K. Cooper and J. Sandoval. Portable Techniques to Find Effective Memory Hierarchy Parameters. Technical report, 2011.Google Scholar
C. ŢĂpuş, I.-H. Chung, and J. K. Hollingsworth. Active harmony: Towards automated performance tuning. In SC '02, pages 1--11. Google ScholarDigital Library
Y. Ding, J. Ansel, K. Veeramachaneni, X. Shen, U. O'Reilly, and S. P. Amarasinghe. Autotuning algorithmic choice for input sensitivity. In In PLDI'15, pages 379--390. Google ScholarDigital Library
J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. In TOMS'90, 16(1):1--17. Google ScholarDigital Library
Z. Fang, S. Mehta, P.-C. Yew, A. Zhai, J. Greensky, G. Beeraka, and B. Zang. Measuring microarchitectural details of multi- and many-core memory systems through microbench marking. ACMTrans. Archit. Code Optim., 11(4):55:1--55:26, Jan. 2015. Google ScholarDigital Library
M. Frigo. A fast fourier transform compiler. In PLDI '99, pages 169--180. Google ScholarDigital Library
S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical representation of cache misses. In ICS '97, pages 317--324. Google ScholarDigital Library
J. Holewinski, L.-N. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In ICS '12, pages 311--320, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
D. Kim, S. S.-w. Liao, P. H. Wang, J. d. Cuvillo, X. Tian, X. Zou, H. Wang, D. Yeung, M. Girkar, and J. P. Shen. Physical experimentation with prefetching helper threads on intel's hyper-threaded processors. In CGO '04. Google ScholarDigital Library
J. Lee, H. Kim, and R. Vuduc. When prefetching works, when it doesn't, and why. In TACO'12, 9(1):2:1--2:29. Google ScholarDigital Library
A. W. Lim, S.-W. Liao, and M. S. Lam. Blocking and array contraction across arbitrarily nested loops using affine partitioning. In PPoPP '01, pages 103--112. Google ScholarDigital Library
S. Mehta, G. Beeraka, and P.-C. Yew. Tile size selection revisited. In TACO'13, 10(4):35:1--35:27. Google ScholarDigital Library
S. Mehta, Z. Fang, A. Zhai, and P.-C. Yew. Multi-stage coordinated prefetching for present-day processors. In ICS '14, pages 73--82. Google ScholarDigital Library
S. Moon and R. H. Saavedra. Hyperblocking: A data reorganization method to eliminate cache conflicts in tiled loop nests. Technical report, Conflicts in Tiled Loop Nests, USC-CS-98-671, USC Computer Science, 1998.Google Scholar
T. C. Mowry,M. S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In ASPLOS-V'92, pages 62--73. Google ScholarDigital Library
L.-N. Pouchet. Polybench Benchmark Suite. Available at http://www\-roc.inria.fr/~pouchet/software/polybench/.Google Scholar
A. Qasem, K. Kennedy, and J. M. Mellor-Crummey. Automatic tuning of whole applications using direct search and a performance-based transformation system. In SC'06, 36(2):183--196. Google ScholarDigital Library
M. Rahman, L.-N. Pouchet, and P. Sadayappan. Neural network assisted tile size selection. In IWAPT '2010.Google Scholar
J. Reinders. VTune performance analyzer essentials.Google Scholar
G. Rivera and C.-W. Tseng. A comparison of compiler tiling algorithms. In CC '99, pages 168--182. Google ScholarDigital Library
G. Rivera and C.-W. Tseng. Tiling optimizations for 3d scientific computations. In SC'00. Google ScholarDigital Library
R. Saavedra, W. Mao, D. Park, J. Chame, and S. Moon. The combined effectiveness of unimodular transformations, tiling, and software prefetching. In IPPS '96, pages 39--45. Google ScholarDigital Library
J. Shirako, K. Sharma, N. Fauzia, L.-N. Pouchet, J. Ramanujam, P. Sadayappan, and V. Sarkar. Analytical bounds for optimal tile size selection. In CC'12, pages 101--121. Google ScholarDigital Library
R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache accurate time skewing in iterative stencil computations. In ICPP '11, pages 571--581, Sept 2011. Google ScholarDigital Library
Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The pochoir stencil compiler. In SPAA '11, pages 117--128, 2011. Google ScholarDigital Library
A. Tiwari, C. Chen, J. Chame, M. Hall, and J. K. Hollingsworth. A scalable auto-tuning framework for compiler optimization. In IPDPS '09, pages 1--12. Google ScholarDigital Library
F. G. Van Zee and R. A. van de Geijn. Blis: A framework for rapidly instantiating blas functionality. TOMS'15, 41(3):14:1--14:33. Google ScholarDigital Library
R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimizations of software and the atlas project. In Parallel Computing, 27:3--35.Google ScholarDigital Library
M. Wolfe. More iteration space tiling. In SC '89, pages 655--664. Google ScholarDigital Library
Q. Yi and J. Guo. Extensive parameterization and tuning of architecture-sensitive optimizations. In ICCS'11, pages 2156--2165.Google Scholar
Q. Yi, K. Seymour, H. You, R. W. Vuduc, and D. J. Quinlan. POET: parameterized optimizations for empirical tuning. In IPDPS'07, pages 1--8.Google Scholar

Index Terms

TurboTiling: Leveraging Prefetching to Boost Performance of Tiled Codes
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Coordinated control of multiple prefetchers in multi-core systems
MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture

Aggressive prefetching is very beneficial for memory latency tolerance of many applications. However, it faces significant challenges in multi-core systems. Prefetchers of different cores on a chip multiprocessor (CMP) can cause significant interference ...
Read More
Defensive loop tiling for multi-core processor
MSPC '12: Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness

Loop tiling is a compiler transformation that tailors an application's working set to fit in a cache hierarchy. On today's multicore processors, part of the hierarchy, especially the last level cache (LLC) is shared. In this paper, we show that cache ...
Read More
Criticality aware tiered cache hierarchy: a fundamental relook at multi-level cache hierarchies
ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture

On-die caches are a popular method to help hide the main memory latency. However, it is difficult to build large caches without substantially increasing their access latency, which in turn hurts performance. To overcome this difficulty, on-die caches ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICS '16: Proceedings of the 2016 International Conference on Supercomputing
June 2016
547 pages
ISBN:9781450343619
DOI:10.1145/2925426

Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Loop tiling
Multi-core
Prefetching
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate584of2,055submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 303
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

TurboTiling: Leveraging Prefetching to Boost Performance of Tiled Codes

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Coordinated control of multiple prefetchers in multi-core systems

Defensive loop tiling for multi-core processor

Criticality aware tiered cache hierarchy: a fundamental relook at multi-level cache hierarchies

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

TurboTiling: Leveraging Prefetching to Boost Performance of Tiled Codes

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Coordinated control of multiple prefetchers in multi-core systems

Defensive loop tiling for multi-core processor

Criticality aware tiered cache hierarchy: a fundamental relook at multi-level cache hierarchies

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media