An Analytical Model-Based Auto-tuning Framework for Locality-Aware Loop Scheduling

Xu, Rengan; Chandrasekaran, Sunita; Tian, Xiaonan; Chapman, Barbara

doi:10.1007/978-3-319-41321-1_1

Rengan Xu¹⁶,
Sunita Chandrasekaran¹⁷,
Xiaonan Tian¹⁶ &
…
Barbara Chapman¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9697))

Included in the following conference series:

International Conference on High Performance Computing

2612 Accesses
1 Citations

Abstract

HPC developers aim to deliver the very best performance. To do so they constantly think about memory bandwidth, memory hierarchy, locality, floating point performance, power/energy constraints and so on. On the other hand, application scientists aim to write performance portable code while exploiting the rich feature set of the hardware. By providing adequate hints to the compilers in the form of directives appropriate executable code is generated. There are tremendous benefits from using directive-based programming. However, applications are also becoming more and more complex and we need sophisticated tools such as auto-tuning to better explore the optimization space. In applications, loops typically form a major and time-consuming portion of the code. Scheduling these loops involves mapping from the loop iteration space to the underlying platform - for example GPU threads. The user tries different scheduling techniques until the best one is identified. However, this process can be quite tedious and time consuming especially when it is a relatively large application, as the user needs to record the performance of every schedule’s run. This paper aims to offer a better solution by proposing an auto-tuning framework that adopts an analytical model guiding the compiler and the runtime to choose an appropriate schedule for the loops, automatically and determining the launch configuration for each of the loop schedules. Our experiments show that the predicted loop schedule by our framework achieves the speedup of 1.29x on an average against the default loop schedule chosen by the compiler.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

EPCC OpenACC Benchmarks (2015). https://www.epcc.ed.ac.uk/research/computing/performance-characterisation-and-benchmarking/epcc-openacc-benchmark-suite
KernelGen Performance Test Suite, December 2015. https://hpcforge.org/plugins/mediawiki/wiki/kernelgen/index.php/Performance_Test_Suite
OpenACC (2016). http://www.openacc.org
Almási, G., Caşcaval, C., Padua, D.A.: Calculating stack distances efficiently. In: ACM SIGPLAN Notices, vol. 38, pp. 37–43. ACM (2002)
Google Scholar
Baghsorkhi, S.S., Delahaye, M., Gropp, W.D., Wen-mei, W.H..: Analytical performance prediction for evaluation and tuning of GPGPU applications. In: Workshop on EPHAM2009, in Conjunction with CGO, Citeseer (2009)
Google Scholar
Beyls, K., Hollander, E.D.: Reuse distance as a metric for cache behavior. In: Proceedings of the IASTED Conference on Parallel and Distributed Computing and Systems, vol. 14, pp. 350–360 (2001)
Google Scholar
Choi, K.H., Kim, S.W.: Study of cache performance on GPGPU. IEIE Trans. Smart Process. Comput. 4(2), 78–82 (2015)
Article Google Scholar
Cui, X., Chen, Y., Zhang, C., Mei, H.: Auto-tuning dense matrix multiplication for GPGPU with cache. In: IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS), pp. 237–242. IEEE (2010)
Google Scholar
Grauer-Gray, S., Xu, L., Searles, R., Ayalasomayajula, S., Cavazos, J.: Auto-tuning a high-level language targeted to GPU codes. In: Innovative Parallel Computing (InPar), pp. 1–10. IEEE (2012)
Google Scholar
Hu, Y., Koppelman, D.M., Brandt, S.R., Löffler, F.: Model-driven auto-tuning of stencil computations on GPUs. In: Proceedings of the 2nd International Workshop on High-Performance Stencil Computations, pp. 1–8 (2015)
Google Scholar
Lee, H., Brown, K.J., Sujeeth, A.K., Rompf, T., Olukotun, K.: Locality-aware mapping of nested parallel patterns on GPUs. In: 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 63–74. IEEE (2014)
Google Scholar
Mametjanov, A., Lowell, D., Ma, C.-C., Norris, B.: Autotuning stencil-based computations on GPUs. In: IEEE International Conference on Cluster Computing (CLUSTER), pp. 266–274. IEEE (2012)
Google Scholar
Montgomery, C., Overbey, J.L., Li, X.: Autotuning openACC work distribution via direct search. In: Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure, p. 38. ACM (2015)
Google Scholar
Nugteren, C., van den Braak, G.-J., Corporaal, H., Bal, H.: A detailed GPU cache model based on reuse distance theory. In: High Performance Computer Architecture (HPCA), pp. 37–48. IEEE (2014)
Google Scholar
Picchi, J., Zhang, W.: Impact of L2 cache locking on GPU performance. In: SoutheastCon, pp. 1–4. IEEE (2015)
Google Scholar
Siddiqui, S., AlZayer, F., Feki, S.: Historic learning approach for auto-tuning openACC accelerated scientific applications. VECPAR-2014. LNCS, vol. 8969, pp. 224–235. Springer, Heidelberg (2014)
Google Scholar
Tang, T., Yang, X., Lin, Y.: Cache miss analysis for GPU programs based on stack distance profile. In: 31st International Conference on Distributed Computing Systems (ICDCS), pp. 623–634. IEEE (2011)
Google Scholar
Tian, X., Xu, R., Yan, Y., Yun, Z., Chandrasekaran, S., Chapman, B.: Compiling a high-level directive-based programming model for GPGPUs. LCPC 2013. LNCS, vol. 8664, pp. 105–120. Springer International Publishing, New York (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Houston, Houston, TX, USA
Rengan Xu, Xiaonan Tian & Barbara Chapman
Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
Sunita Chandrasekaran

Authors

Rengan Xu
View author publications
You can also search for this author in PubMed Google Scholar
Sunita Chandrasekaran
View author publications
You can also search for this author in PubMed Google Scholar
Xiaonan Tian
View author publications
You can also search for this author in PubMed Google Scholar
Barbara Chapman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rengan Xu .

Editor information

Editors and Affiliations

Deutsches Klimarechenzentrum, Hamburg, Germany
Julian M. Kunkel
Argonne National Laboratory, Lemont, Illinois, USA
Pavan Balaji
University of Tennessee, Knoxville, Tennessee, USA
Jack Dongarra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, R., Chandrasekaran, S., Tian, X., Chapman, B. (2016). An Analytical Model-Based Auto-tuning Framework for Locality-Aware Loop Scheduling. In: Kunkel, J., Balaji, P., Dongarra, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9697. Springer, Cham. https://doi.org/10.1007/978-3-319-41321-1_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-41321-1_1
Published: 15 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41320-4
Online ISBN: 978-3-319-41321-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics