Towards an Achievable Performance for the Loop Nests

Shivam, Aniket; Watkinson, Neftali; Nicolau, Alexandru; Padua, David; Veidenbaum, Alexander V.

doi:10.1007/978-3-030-34627-0_6

Aniket Shivam¹⁰,
Neftali Watkinson¹⁰,
Alexandru Nicolau¹⁰,
David Padua¹¹ &
…
Alexander V. Veidenbaum¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11882))

Included in the following conference series:

International Workshop on Languages and Compilers for Parallel Computing

466 Accesses

Abstract

Numerous code optimization techniques, including loop nest optimizations, have been developed over the last four decades. Loop optimization techniques transform loop nests to improve the performance of the code on a target architecture, including exposing parallelism. Finding and evaluating an optimal, semantic-preserving sequence of transformations is a complex problem. The sequence is guided using heuristics and/or analytical models and there is no way of knowing how close it gets to optimal performance or if there is any headroom for improvement.

This paper makes two contributions. First, it uses a comparative analysis of loop optimizations/transformations across multiple compilers to determine how much headroom may exist for each compiler. And second, it presents an approach to characterize the loop nests based on their hardware performance counter values and a Machine Learning approach that predicts which compiler will generate the fastest code for a loop nest. The prediction is made for both auto-vectorized, serial compilation and for auto-parallelization. The results show that the headroom for state-of-the-art compilers ranges from 1.10x to 1.42x for the serial code and from 1.30x to 1.71x for the auto-parallelized code. These results are based on the Machine Learning predictions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A methodology correlating code optimizations with data memory accesses, execution time and energy consumption

Article 13 May 2019

A methodology pruning the search space of six compiler transformations by addressing them together as one problem and by exploiting the hardware architecture details

Article 09 January 2017

Learning from Automatically Versus Manually Parallelized NAS Benchmarks

References

Allen, R., Kennedy, K.: Automatic translation of fortran programs to vector form. ACM Trans. Program. Lang. Syst. 9(4), 491–542 (1987)
Article Google Scholar
Ashouri, A.H., et al.: MiCOMP: mitigating the compiler phase-ordering problem using optimization sub-sequences and machine learning. ACM Trans. Arch. Code Optim. (TACO) 14(3), 29 (2017)
Google Scholar
Bondhugula, U., Baskaran, M., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In: Hendren, L. (ed.) CC 2008. LNCS, vol. 4959, pp. 132–146. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78791-4_9
Chapter Google Scholar
Callahan, D., Dongarra, J., Levine, D.: Vectorizing compilers: a test suite and results. In: Proceedings of the 1988 ACM/IEEE Conference on Supercomputing, Supercomputing 1988, pp. 98–105. IEEE Computer Society Press, Los Alamitos (1988)
Google Scholar
Cammarota, R., Beni, L.A., Nicolau, A., Veidenbaum, A.V.: Optimizing program performance via similarity, using a feature-agnostic approach. In: Wu, C., Cohen, A. (eds.) APPT 2013. LNCS, vol. 8299, pp. 199–213. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-45293-2_15
Chapter Google Scholar
Cavazos, J., et al.: Rapidly selecting good compiler optimizations using performance counters. In: International Symposium on Code Generation and Optimization, CGO 2007, pp. 185–197. IEEE (2007)
Google Scholar
Darte, A., Robert, Y., Vivien, F.: Scheduling and Automatic Parallelization. Springer, New York (2012). https://doi.org/10.1007/978-1-4612-1362-8
Book MATH Google Scholar
Demšar, J., et al.: Orange: data mining toolbox in python. J. Mach. Learn. Res. 14(1), 2349–2353 (2013)
MATH Google Scholar
Fursin, G., et al.: Milepost GCC: machine learning enabled self-tuning compiler. Int. J. Parallel Program. 39(3), 296–327 (2011)
Article Google Scholar
Gong, Z., et al.: An empirical study of the effect of source-level loop transformations on compiler stability. Proc. ACM Program. Lang. 2(OOPSLA), 126:1–126:29 (2018)
Article Google Scholar
Grosser, T., Groesslinger, A., Lengauer, C.: Polly - performing polyhedral optimizations on a low-level intermediate representation. Parallel Process. Lett. 22(04), 1250010 (2012)
Article MathSciNet Google Scholar
Kennedy, K., Allen, J.R.: Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann Publishers Inc., San Francisco (2002)
Google Scholar
Li, W., Pingali, K.: A singular loop transformation framework based on non-singular matrices. In: Banerjee, U., Gelernter, D., Nicolau, A., Padua, D. (eds.) LCPC 1992. LNCS, vol. 757, pp. 391–405. Springer, Heidelberg (1993). https://doi.org/10.1007/3-540-57502-2_60
Chapter Google Scholar
Lim, A.W., Cheong, G.I., Lam, M.S.: An affine partitioning algorithm to maximize parallelism and minimize communication. In: Proceedings of the 13th International Conference on Supercomputing, ICS 1999, pp. 228–237. ACM, New York (1999)
Google Scholar
Lim, A.W., Lam, M.S.: Maximizing parallelism and minimizing synchronization with affine partitions. Parallel Comput. 24(3–4), 445–475 (1998)
Article MathSciNet Google Scholar
Lim, A.W., Liao, S.-W., Lam, M.S.: Blocking and array contraction across arbitrarily nested loops using affine partitioning. In: Proceedings of the Eighth ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, PPoPP 2001, pp. 103–112. ACM, New York (2001)
Google Scholar
Maleki, S., et al.: An evaluation of vectorizing compilers. In: 2011 International Conference on Parallel Architectures and Compilation Techniques, pp. 372–382, October 2011
Google Scholar
Padua, D.A., Kuck, D.J., Lawrie, D.H.: High-speed multiprocessors and compilation techniques. IEEE Trans. Comput. C-29(9), 763–776 (1980)
Article Google Scholar
Padua, D.A., Wolfe, M.: Advanced compiler optimizations for supercomputers. Commun. ACM 29(12), 1184–1201 (1986)
Article Google Scholar
Polly: LLVM Framework for High-Level Loop and Data-Locality Optimizations. http://polly.llvm.org
PolyBench/C 4.1. http://web.cse.ohio-state.edu/~pouchet/software/polybench/
Stock, K., Pouchet, L.-N., Sadayappan, P.: Using machine learning to improve automatic vectorization. ACM Trans. Arch. Code Optim. (TACO) 8(4), 50 (2012)
Google Scholar
Tournavitis, G., et al.: Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping. In: Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2009, pp. 177–187. ACM, New York (2009)
Article Google Scholar
Wang, Z., O’Boyle, M.F.: Mapping parallelism to multi-cores: a machine learning based approach. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2009, pp. 75–84. ACM, New York (2009)
Google Scholar
Watkinson, N., et al.: Using hardware counters to predict vectorization. In: Languages and Compilers for Parallel Computing, LCPC 2017. Springer, in Press
Google Scholar
Wolfe, M.J.: High Performance Compilers for Parallel Computing. Addison-Wesley Longman Publishing Co. Inc., Boston (1995)
MATH Google Scholar

Download references

Acknowledgments

This work was supported by NSF award XPS 1533926.

Author information

Authors and Affiliations

Department of Computer Science, University of California, Irvine, Irvine, USA
Aniket Shivam, Neftali Watkinson, Alexandru Nicolau & Alexander V. Veidenbaum
Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, USA
David Padua

Authors

Aniket Shivam
View author publications
You can also search for this author in PubMed Google Scholar
Neftali Watkinson
View author publications
You can also search for this author in PubMed Google Scholar
Alexandru Nicolau
View author publications
You can also search for this author in PubMed Google Scholar
David Padua
View author publications
You can also search for this author in PubMed Google Scholar
Alexander V. Veidenbaum
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aniket Shivam .

Editor information

Editors and Affiliations

University of Utah, Salt Lake City, UT, USA
Mary Hall
University of Utah, Salt Lake City, UT, USA
Hari Sundar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shivam, A., Watkinson, N., Nicolau, A., Padua, D., Veidenbaum, A.V. (2019). Towards an Achievable Performance for the Loop Nests. In: Hall, M., Sundar, H. (eds) Languages and Compilers for Parallel Computing. LCPC 2018. Lecture Notes in Computer Science(), vol 11882. Springer, Cham. https://doi.org/10.1007/978-3-030-34627-0_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-34627-0_6
Published: 13 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34626-3
Online ISBN: 978-3-030-34627-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics