skip to main content
10.1145/3649153.3649194acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article
Open access

Register Blocking: An Analytical Modelling Approach for Affine Loop Kernels

Published: 02 July 2024 Publication History

Abstract

For the past several decades, optimizing compilers have been a primary area of focus in both industry and academia. This continued research interest is a testament to the complexity of this task, primarily stemming from the vast number of parameters that must be explored to attain near-optimal results. One of the key compiler optimizations is "Register Blocking (RB)" also known as "Register-level Tiling" or "unroll-and-jam". RB can strongly reduce the number of executed Load/Store (L/S) instructions, and as a consequence the number of data accesses in memory hierarchy, but due to its inherent complexities, fine-tuning is essential for its effective implementation. To address this problem, in this work a new methodology is proposed for RB. The RB factors, the loops to apply RB, the number of allocated variables/registers per array reference, and the loops' ordering are generated by an analytical model, leveraging the target hardware (HW) architecture details and loop kernel characteristics. The proposed methodology has been evaluated on both embedded and general-purpose CPUs across seven well-known loop kernels, achieving high speedups and L/S instruction gains over GCC compiler, handwritten optimized codes, and the popular Pluto tool.

References

[1]
A. Acharya, U. Bondhugula, and A. Cohen. Effective Loop Fusion in Polyhedral Compilation Using Fusion Conflict Graphs. Transactions on Architecture and Code Optimization, 2020
[2]
U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. International Conference on Programming Language Design and Implementation, 2008
[3]
S. Carr and K. Kennedy. Improving the Ratio of Memory Operations to Floating-Point Operations in Loops. Transactions on Programming Languages and Systems, 1994
[4]
S. Carr and Y. Guan. Unroll-and-Jam Using Uniformly Generated Sets. International Symposium on Microarchitecture, 1997
[5]
C. Chen, J. Chame, and M. Hall. Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy. International Symposium on Code Generation and Optimization, 2005
[6]
E. Herruzo, G. Bandera, E.L. Zapata, and O. Plata. Reducing Cache Misses by Loop Reordering. International Conference on Parallel Computing, 2006
[7]
V. Kelefouras and K. Djemame. A Methodology Correlating Code Optimizations with Data Memory Accesses, Execution Time, and Energy Consumption. Journal of Supercomputing, 2019
[8]
V. Kelefouras and G. Keramidas. Design and Implementation of Deep Learning 2D Convolutions on Modern CPUs. Transactions on Parallel and Distributed Systems, 2023
[9]
M. Kong and L.N. Pouchet. Model-Driven Transformations for Multi- and Many-Core CPUs. International Conference on Programming Language Design and Implementation, 2019
[10]
M. Kong and L.N. Pouchet. A Performance Vocabulary for Affine Loop Transformations. arXiv preprint arXiv:1811.06043, 2018
[11]
L. Lai, N. Suda, and V. Chandra. CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs. arXiv preprint arXiv:1801.06601, 2018
[12]
J. Li, Z. Qin, Y. Mei, J. Cui, Y. Song, C. Chen, Y. Zhang, L. Du, X. Cheng, B. Jin, J. Ye, E. Lin, and D. Lavery. oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation. arXiv preprint arXiv:2301.01333, 2023
[13]
X. Liu, L. Ding, Y. Li, G. Chen, and J. Du. Research of Register Pressure Aware Loop Unrolling Optimizations for Compiler. MATEC Web of Conferences, 2018
[14]
LLVM Compiler: https://github.com/LLVM/LLVM-project/issues/38004
[15]
B. Meister, N. Vasilache, D. Wohlford, M. Baskaran, A. Leung, and R. Lethin. R-stream Compiler. In Encyclopedia of Parallel Computing, 2011
[16]
Orio Tool: https://github.com/brnorris03/Orio
[17]
Pluto Tool: https://Pluto-compiler.sourceforge.net/
[18]
L.N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, P. Sadayappan, and N. Vasilache. Loop Transformations: Convexity, Pruning and Optimization. International Symposium on Principles of Programming Languages, 2011
[19]
Polly Tool: https://polly.LLVM.org/docs/UsingPollyWithClang.html
[20]
Github url, Register Blocking Source-to-Source:https://github.com/Theoo1997/RB_s2s
[21]
V. Sarkar. Optimized Unrolling of Nested Loops. Journal of Parallel Programming, 2001
[22]
Valgrind Tool: https://valgrind.org/
[23]
N. Vasilache, B. Meister, M. Baskaran, and R. Lethin. Joint Scheduling and Layout Optimization to Enable Multi-Level Vectorization. International Workshop on Polyhedral Compilation Techniques, 2012
[24]
L. Wilkinson, K. Cheshmi, and M.M. Dehnavi. Register Tiling for Unstructured Sparsity in Neural Network Inference. International Conference on Programming Languages, 2023
[25]
N. Tollenaere, G. Iooss, S. Pouget, H. Brunie, C. Guillon, A. Cohen, P. Sadayappan, F. Rastello. Autotuning Convolutions is Easier than you Think. ACM Transactions on Architecture and Code Optimization, 2023
[26]
O. Zinenko, S. Verdoolaege, C. Reddy, J. Shirako, T. Grosser, V. Sarkar, and A. Cohen. Modeling the Conflicting Demands of Parallelism and Temporal/Spatial Locality in Affine Scheduling. International Conference on Compiler Construction, 2018

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CF '24: Proceedings of the 21st ACM International Conference on Computing Frontiers
May 2024
345 pages
ISBN:9798400705977
DOI:10.1145/3649153
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 July 2024

Check for updates

Author Tags

  1. CPUs
  2. Compiler Optimizations
  3. Data Reuse
  4. High Performance Computing
  5. Register Blocking
  6. Register Tiling
  7. Unroll-and-Jam

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

CF '24
Sponsor:

Acceptance Rates

CF '24 Paper Acceptance Rate 33 of 105 submissions, 31%;
Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 136
    Total Downloads
  • Downloads (Last 12 months)136
  • Downloads (Last 6 weeks)27
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media