An efficient tile size selection model based on machine learning

https://doi.org/10.1016/j.jpdc.2018.06.005Get rights and content

Highlights

  • We revisit the tile size selection problem of loop tiling by machine learning.

  • Extracted features can capture the effect of data locality and vectorization.

  • We build a tile size prediction model with generalized regression neural network.

  • Predicted tile sizes can be adapted to different threads for parallel load balance.

  • Results show near optimal performance over different benchmarks on 2 platforms.

Abstract

Tiling is a classic loop optimization to improve data locality and achieve coarse-grained parallelism. Tile size selection (TSS) plays an important role in tiling to determine the performance of tiled codes. Most of the previous TSS approaches involve much highly skilled manpower, but it is still difficult to find the optimal tile sizes. In this article, we propose an efficient TSS model using machine learning technique to predict optimal rectangular tile sizes for a given program on multi-core processors. A set of loop features is extracted on tiled codes to capture the locality of data references and the effect of vectorization in tiled loop dimensions. Using the features and corresponding best tile sizes, the generalized regression neural network is employed to build the TSS model, hiding the complicated interactions between tile sizes and underlying factors. Although the impact of multithreading is not directly considered in training the model, the predicted tile sizes can be well adapted to different numbers of threads. Experimental results show that the predicted tile sizes achieve 90% and 81% of the optimal performance on average for 20 selected benchmarks on an Intel Xeon and an IBM Power6 multi-core platforms, respectively. The optimal performance is delivered by the tile sizes that are obtained through a heuristically exhaustive search. Our TSS model outperforms an artificial neural network (ANN)-based TSS prediction model which depends on the prefetched features by over 9% in average performance for 9 benchmarks. It also outperforms a state-of-the-art analytical TSS model which uses the cache set associativity and interaction with the single instruction multiple data (SIMD) units to estimate the optimal tile sizes by over 7% in average performance for 7 benchmarks.

Introduction

Nested loops are generally hot spots in many important scientific computing kernels, which take up most of the execution time and may easily lead to frequent cache misses. Tiling [[19], [29], [32], [46], [47], [48]] is a classic loop transformation widely used in program optimization to enhance data locality in higher levels of memory hierarchy and exploit coarse-grained parallelism. Loop tiling reorders iterative computations by traversing the iteration space according to the tile sizes, which reduces data reuse distance to minimize cache misses. The choice of different tile sizes leads to significant variation in performance of the tiled codes. Thus, the tile size selection (TSS) plays an important role in the effective use of loop tiling. However, the selection of optimal tile sizes has become ever more challenging since the influencing factors of on-chip memory hierarchy and program running environments on the tile sizes are increasingly complicated.

Previous work on TSS solutions mainly falls into three categories, i.e. analytical approaches [[6], [8], [22], [27], [50]], empirical search [[7], [12], [30], [45]], and machine learning based approaches [[24], [31], [40], [44], [52]]. In analytical approaches, the TSS models are constructed on the static analysis of loop codes and critical features in modern processor architecture parameters, calculating the best tile sizes for a given combination of program, architecture, and compiler. However, these approaches are proved to be less effective in practice because it is difficult to figure out the complex interactions between source program characteristics and execution environments. Hence, the performance of tiled codes with the tile sizes selected by analytical models lags behind that yielded by the actual best tile sizes. Besides, manually creating an accurate analytical TSS model is not an easy work. The TSS model also needs to be rebuilt while adapting to different processor architectures or loop structures. Meanwhile, keeping a TSS model up-to-date with respect to the progress of processor architectures and compilers often involves much human effort.

In empirical search approaches, the tiled code is repeatedly generated and executed for a huge search space of different tile sizes to pick the optimal tile sizes on the target machine. One of the most crucial issues faced by empirical tuning is the enormous search space to be explored when considering multi-dimensional tile sizes or rectangular tile sizes, i.e. different tile sizes in different loop dimensions. This approach consumes so much time that many empirical approaches only adopt cubic tiles, i.e. equal tile sizes along all loop dimensions, to reduce the search space. But the cubic tile has been shown that it is probably not the best in general and therefore resulting in unsatisfactory performance [[15], [26]]. Since the empirical approaches are usually combined with analytical models to perform heuristic search [[5], [41]], the TSS model is used to prune valid search space for reducing the time cost. However, these approaches have not been widely applicable due to the time cost and accuracy issue.

The machine learning techniques have been used in TSS problem in recent years. For these approaches [[31], [40], [52]], the program features characterizing the crucial interactions between performance and tile sizes are extracted to build the optimal tile sizes prediction model by using machine learning techniques, such as artificial neural networks and classifiers. This kind of approaches can effectively hide the complicated influencing factors of processor architectures and intertwined compiler optimization phases on TSS. Because the training data collected on the real architecture-compiler environment has ability to express the inherent connections between tile sizes and these factors. When the hardware platform and compiler changes, the training data will be re-collected and the new dataset could reflect the influences of the changed running environment. Further, the data collection can be accomplished without much manpower. Hence, the extraction of the program features has become a key step in creating an accurate TSS prediction model. But finding effective features that capture the essential connections between performance of tiled code and corresponding tile sizes is still under dispute.

This article proposes a new TSS model based on machine learning technique to predict optimal rectangular tile sizes for loop codes on modern multi-core processors. The primary loop features are extracted from the tiled codes in the light of data locality in multiple loop dimensions and the vectorization in innermost loop dimension, which leverages the locality of data references to fit the working set sizes of tiles in multilevel caches and capture the exact effect of tile sizes on the performance of tiled codes. A generalized regression neural network (GRNN) is employed to build the TSS model for the tile sizes prediction by using artificially synthesized programs to generate a plenty of loop features and corresponding best tile sizes as the training datasets. Although only 4 threads are used to train the model, the predicted tile sizes can be adapted to different numbers of threads. To evaluate the proposed TSS model, 20 typical kernels including 3D/2D loops and 2D/1D data with 3 different kinds of problem sizes were chosen to carry out a series of experiments on an Intel Xeon multi-core platform and an IBM Power6 multi-core platform. The predicted tile sizes achieved stable near-optimal performance on both platforms. The results also indicated that the proposed model outperformed an artificial neural network (ANN)-based model and a state-of-the-art analytical model. Our approach yielded good performance when applying various numbers of threads. Overall, this article has made the following contributions.

  • The loop features extracted from the tiled codes are able to effectively capture the locality of data references in all tiled loop dimensions and the effect of vectorization in the innermost loop dimension, predicting the optimal rectangular tile sizes for the target programs. And the features are experimentally proved to be necessary and effective for the good quality of constructing the TSS model.

  • The TSS model is built with machine learning, specifically a GRNN, to hide the intricate underlying influencing factors involved in the TSS and provide stable near-optimal performance. The proposed approach could be extended to more/less dimensions of loops and data, not limited to 3D loops with 2D data. In addition, the performance can be improved noticeably when the model is combined with a simple local search. And the proposed approach could be platform and hardware independent for the target programs.

  • A post-processing approach is proposed to adapt the predicted tile sizes to different numbers of threads through simple adjustment. Although the effect of multithreading is not directly considered and only 4 threads are used to train the TSS model, the proposed approach leverages the crucial impact of parallel load balance among threads to adjust the predicted tile sizes for different numbers of threads, keeping the good locality the predicted tile sizes have achieved and achieving good performance for various numbers of threads.

The rest of this article is structured as follows. Section 2 introduces the related work and motivation of our work. Section 3 details the loop features of the TSS model. Section 4 describes the process of building the TSS model with GRNN and propose the approach adapting the predicted tile sizes to different numbers of threads. Section 5 shows the experiments and results. Section 6 concludes this article and points out the future work.

Section snippets

Related work

The TSS has been extensively studied to explore data locality and coarse-grained parallelism for loop codes. The approaches used in the study of TSS are categorized into three kinds: static analytical model, model-driven empirical search, and machine learning techniques. There is already some literature [[27], [33]] that has analyzed and summarized previous TSS models.

The early research focuses on emulating the access behaviors in cache to estimate the best tile sizes by using program and

Target loop structures

The proposed approach targets on the nested loops which can benefit from loop tiling. These loops usually suffer frequent cache misses due to the exceeding-cache-capacity reuse distances. Tiling can effectively reduce the reuse distance in the function of tile sizes and thus achieves good locality and parallelism on tiled codes. Parallelizable loops are mainly divided into two categories, i.e. DOALL and DOACROSS loops [9]. DOALL loops have no loop-carried dependence and the ones that have good

Synthetic code generation

A large number of programs with wide range of locality and vectorization features are necessary for the training datasets of TSS model. The real applications or kernels can be used to gather data for training neural network. But it is almost impossible to obtain a large range of features from real applications or kernels to cover all possible values. Moreover, some real kernels are used to validate the proposed TSS model. Therefore, we do not make the real applications or kernel as the sources

Experimental setup

The experiments were performed on 2 different sets of platforms. One is an Intel Xeon server and the other is an IBM Power6 server. The details of the 2 experimental platforms are listed in Table 3. The Intel platform has four 8-core processors and the IBM platform has one 4-core processor. All tiled codes were generated using PLuTo version 0.11.3. We only measured the execution time of the tiled loop codes of each program instance in the evaluation for fairness. And each experimental result is

Conclusion

This article restudied the key problem of TSS for profitable loop tiling and proposed an effective approach to predict optimal rectangular tile sizes by machine learning technique. The proposed approach leverages the locality of data references in multiple loop dimensions and the interaction with vectorization in innermost loop dimension to extract the program features on the tiled codes. Through setting the order of locality features for each dimension, the different impacts on locality of

Acknowledgments

This work has been supported by National Natural Science Foundation of China[grant numbers 91630206, 91330117] and National Key Research and Development Program of China [grant numbers 2016YFB0201800, 2016YFB0200902].

Song Liu received the B.S. degree in computer science from the Northwestern Polytechnical University, China, in 2009. And he received the Ph.D. degree in computer science from the Xi’an Jiaotong University, China, in 2018. He is currently with the School of Electronic Information and Engineering at Xi’an Jiaotong University. His research interests include code optimization, parallel computing and compiler optimization.

References (52)

  • RamanujamJ. et al.

    Tiling multidimensional iteration spaces for multicomputers

    J. Parallel Distrib. Comput.

    (1992)
  • SpechtD.F.

    The general regression neural network —rediscovered

    Neural Netw.

    (1993)
  • N. Ahmed, N. Mateev, K. Pingali, Tiling Imperfectly-Nested Loop Nests, Univ. Cornell, Ithaca, NY, USA, Tech. Rep....
  • AhmedN. et al.

    Synthesizing transformations for locality enhancement of imperfectly-nested loop nests

    Int. J. Parallel Program.

    (2001)
  • J. Bilmes, K. Asanovic, C.W. Chin, J. Demmel, Optimizing matrix multiply using PHiPAC: a portable, high-performance,...
  • U. Bondhugula, M.M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan, Automatic transformations for...
  • J. Cavazos, J.E.B. Moss, Inducing heuristics to decide whether to schedule, in: Proc. ACM SIGPLAN Conf. PLDI,...
  • J. Chame, S. Moon, A tile selection algorithm for data locality and cache interference, in: Proc. Int. Conf....
  • C. Chen, J. Chame, M. Hall, Combining models and guided empirical search to optimize for multiple levels of the memory...
  • S. Coleman, K. McKinley, Tile size selection using cache organization and data layout, in: Proc. ACM SIGPLAN Conf....
  • R. Cytron, Doacross: beyond vectorization for multiprocessors, in: Proc. Int. Conf. Parallel Process. University Park,...
  • D. Feld, T. Soddemann, S. Mallach, Hardware-aware automatic code- transformation to support compilers in exploiting the...
  • J. Ferrante, V. Sarkar, W. Thrash, On estimating and enhancing cache effectiveness, in: Proc. Int. Workshop Languages...
  • B.B. Fraguela, M.G. Carmueja, D. Andrade, Optimal tile size selection guided by analytical models, in: Proc. Int. Conf....
  • M. Frigo, A fast Fourier transform compiler, in: Proc. ACM SIGPLAN Conf. PLDI, Atlanta, GA, USA, 1999, pp....
  • S. Ghosh, M. Martonosi, S. Malik, Cache miss equations: an analytical representation of cache misses, in: Proc. Int....
  • GotoK. et al.

    High-performance implementation of the level-3 BLAS

    ACM Trans. Math. Softw.

    (2008)
  • A. Hartono, M.M. Baskaran, C. Bastoul, A. Cohen, S. Krishnamoorthy, B. Norris, J. Ramanujam, PrimeTile: a parametric...
  • HsuC. et al.

    A quantitative analysis of tile size selection algorithms

    J. Supercomput.

    (2004)
  • IpekE. et al.

    Efficient architectural design space exploration via predictive modeling

    ACM Trans. Archit. Code Optim.

    (2008)
  • F. Irigoin, R. Triolet, Supernode partitioning, in: Proc. 15th ACM, SIGPLAN-SIGACT Symp. POPL, San Diego, CA, USA,...
  • D.G. Kim, L. Renganarayanan, D. Rostron, S. Rajopadhye, Multilevel tiling: M for the price of one, in: Proc. ACM/IEEE...
  • M. Kong, R. Veras, K. Stock, F. Franchetti, P. Sadayappan, When polyhedral transformations meet SIMD code generation,...
  • M.D. Lam, E.E. Rothberg, M.E. Wolf, The cache performance and optimizations of blocked algorithms, in: Proc. Int. Conf....
  • S. Larsen, S. Amarasinghe, Exploiting superword level parallelism with multimedia instruction sets, in: Proc. ACM...
  • X. Li, M.J. Garzarán, Optimizing matrix multiplication with a classifier learning system, in: Proc. Int. Conf....
  • Cited by (7)

    • TurboStencil: You only compute once for stencil computation

      2023, Future Generation Computer Systems
    • Efficiently Learning Locality Optimizations by Decomposing Transformation Domains

      2023, CC 2023 - Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction
    • Parallelized Heat Map Algorithm Using Multiple Cores

      2020, Lecture Notes in Electrical Engineering
    View all citing articles on Scopus

    Song Liu received the B.S. degree in computer science from the Northwestern Polytechnical University, China, in 2009. And he received the Ph.D. degree in computer science from the Xi’an Jiaotong University, China, in 2018. He is currently with the School of Electronic Information and Engineering at Xi’an Jiaotong University. His research interests include code optimization, parallel computing and compiler optimization.

    Yuanzhen Cui received the B.S. degree in computer science from the Xi’an Jiaotong University, China, in 2016 and he is currently working toward the M.S. degree in the School of Electronic Information and Engineering at Xi’an Jiaotong University. His research interests include code optimization.

    Qing Jiang received the B.S. and M.S. degrees in computer science from the Xi’an Jiaotong University, China, in 2013 and 2016. He is currently with the Department of Trading System at China Financial Futures Exchange, Shanghai, China. His research interests include machine learning.

    Qian Wang received the B.S. degree in internet of things from the Nanjing University of Aeronautics and Astronautics, China, in 2017 and she is currently working toward the M.S. degree in the School of Electronic Information and Engineering at Xi’an Jiaotong University. Her research interests include parallel computing.

    Weiguo Wu received the B.S., M.S. and Ph.D. degrees in computer science from the Xi’an Jiaotong University, China, in 1986, 1993 and 2006. He is currently with the School of Electronic Information and Engineering at Xi’an Jiaotong University as a professor. He is a senior member of the CCF. His research interests include high performance computer architecture, storage system, cloud computing, and embedded system.

    View full text