Automated Compiler Optimization of Multiple Vector Loads/Stores

Aleen, Farhana; Zakharin, Vyacheslav P.; Krishnaiyer, Rakesh; Gupta, Garima; Kreitzer, David; Lin, Chang-Sun

doi:10.1007/s10766-016-0485-7

Automated Compiler Optimization of Multiple Vector Loads/Stores

Published: 09 January 2017

Volume 46, pages 471–503, (2018)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Farhana Aleen¹,
Vyacheslav P. Zakharin ORCID: orcid.org/0000-0003-3791-9734²,
Rakesh Krishnaiyer¹,
Garima Gupta¹,
David Kreitzer³ &
…
Chang-Sun Lin Jr¹

391 Accesses
4 Citations
Explore all metrics

Abstract

With widening vectors and the proliferation of advanced vector instructions in today’s processors, vectorization plays an ever-increasing role in delivering application performance. Achieving the performance potential of this vector hardware has required significant support from the software level such as new explicit vector programming models and advanced vectorizing compilers. Today, with the combination of these software tools plus new SIMD ISA extensions like gather/scatter instructions it is not uncommon to find that even codes with complex and irregular data access patterns can be vectorized. In this paper we focus on these vectorized codes with irregular accesses, and show that while the best-in-class Intel Compiler Vectorizer does indeed provide speedup through efficient vectorization, there are some opportunities where clever program transformations can increase performance further. After identifying these opportunities, this paper describes two automatic compiler optimizations to target these data access patterns. The first optimization focuses on improving the performance for a group of adjacent gathers/scatters. The second optimization improves performance for a group of stencil vector accesses using more efficient SIMD instructions. Both optimizations are now implemented in the 17.0 version of the Intel Compiler. We evaluate the optimizations using an extensive set of micro-kernels, representative benchmarks and application kernels. On these benchmarks, we demonstrate performance gains of 3–750% on the Intel\({\textregistered }\) Xeon processor (Haswell—HSW), up to 25% on the Intel\({\textregistered }\) Xeon \(\hbox {Phi}^{\mathrm{TM}}\) coprocessor (Knights Corner—KNC), and up to 430% on the Intel\({\textregistered }\) Xeon \(\hbox {Phi}^{\mathrm{TM}}\) processor with AVX-512 instructions support (Knights Landing—KNL).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluating the Effectiveness of a Vector-Length-Agnostic Instruction Set

Using Arm’s scalable vector extension on stencil codes

Article 08 April 2019

PostSLP: Cross-Region Vectorization of Fully or Partially Vectorized Code

References

Kennedy, R., et al.: Partial redundancy elimination in SSA Form. In: ACM TOPLAS (1999)
Briggs, P., Cooper, K.: Effective partial redundancy elimination. In: PLDI (1994)
Intel\({\textregistered }\) 64 and IA-32 Architectures Software Developer’s Manual
Intel’s Haswell CPU Microarchitecture. http://www.realworldtech.com/haswell-cpu/2/
Kamil, S. et al.: Implicit and explicit optimizations for stencil computations. In: MSPC ’06 (2006)
Caballero, D., et al.: Optimizing overlapped Memory Accesses in User-directed vectorization. In: ICS (2015)
Talla, D., John, L.K., Burger, D.: Bottlenecks in multimedia processing with SIMD style extensions and enhancements. IEEE Trans (August 2003)
Eichenberger, A.E., et al.: Vectorization for SIMD architectures with alignment constraints. In: PLD I (2004)
Nuzman, D., et al.: Auto-vectorization of interleaved data for SIMD. In: PLDI (June 2006)
Xu, S., Greg, D.: Efficient exploitation of hyper loop parallelism in vectorization*. 27th international workshop, LCPC (2014)
Henretty, T., et al.: A stencil compiler for short-vector SIMD architectures. In: ICS (2013)
http://www.drdobbs.com/go-parallel/article/print?articleId=224202549
Kahle, J.A., et al.: Introduction to the cell multiprocessor. IBM J. Res. Dev. 49(4/5), 589–604 (2005). http://dl.acm.org/citation.cfm?id=1148891
Franchetti, F., et al.: A SIMD vectorizing compiler for digital signal processing algorithms. In: IPDPS (2002)
Holewinsk, J., et al.: High-performance code generation for stencil computations on GPU architectures. In: ICS (2012)
Leupers, R.: Code selection for media processors with SIMD instructions. In: DATE ’00 (2000)
Heintze, N., Tardieu, O.: Ultra-fast aliasing analysis using CLA. In: PLDI (May 2001)
Gallagher, D.M.: Memory disambiguation to facilitate instruction-level parallelism compilation. Ph.D. thesis, Univ. of Illinois, Urbana, IL (1995)
Ghiya, R., et al.: On the importance of points-to analysis and other memory disambiguation methods for C programs. In: PLD I (2001)
Hwu, W.W., et al.: Compiler technology for future microprocessors. In Proc. of the IEEE (1995)
Dz-ching Ju, R., et al.: Probabilistic memory disambiguation and its application to data speculation. In: PACT’98 (1998)
Lowney, P.G., Freudenberger, S.M., Karzes, T.J., et al.: The multiflow trace scheduling compiler. J. Supercomput. 7, 51 (1993). doi:10.1007/BF01205182
Article Google Scholar
Seonggun, K., et al.: Efficient SIMD code generation for irregular kernels. In: PPoPP (August 2012)
Larsen, S., et al.: Exploiting superword level parallelism with multimedia instruction sets. In: PLDI ’00 (2000)
http://impact.crhc.illinois.edu/parboil/parboil.aspx
Satish, N., et al.: Can traditional programming bridge the ninja performance gap for parallel computing applications? In: ISCA ’12 (2012)
Sreraman, N., et al.: A vectorizing compiler for multimedia extensions. Int. J. Parallel Program. 28(4), 363–400 (2000). http://dl.acm.org/citation.cfm?id=608743
Maleki, S., et al.: An evaluation of vectorizing compilers. In: PACT’11 (2011)
Kong, M., et al.: When polyhedral transformations meet SIMD code generation. In: PLDI’13 (2013)
Barik, R., et al.: Efficient selection of vector instructions using dynamic programming. In: MICRO (2010)
Kudriavtsev, A., et al.: Generation of permutations for SIMD processors. In: LCTES (July 2005)
Liu, J. et al.: A compiler framework for extracting superword level parallelism. In: PLDI’2012 (2012)
Dursun, H., et al.: In-core optimization of high-order stencil computations. In: PDPTA (2009)
Intel Corp. Intel\({\textregistered }\) Cilk™ Plus Language Extension Specification Version 1.2
Tian, X., et al.: Practical SIMD vectorization techniques for Intel\({\textregistered }\) xeon phi coprocessors. In: IPDPSW (May 2013)
Klemm, M. et al.: Extending openMP with vector constructs for multicore SIMD architectures. In: IWOMP’12 (2012)
https://software.intel.com/en-us/blogs/2014/11/24/what-is-new-for-x86-in-upcoming-gcc-50
https://software.intel.com/en-us/articles/algorithms-to-vectorize-load-groups

Download references

Author information

Authors and Affiliations

Intel Corporation, Santa Clara, USA
Farhana Aleen, Rakesh Krishnaiyer, Garima Gupta & Chang-Sun Lin Jr
Intel Corporation, Novosibirsk, Russian Federation
Vyacheslav P. Zakharin
Intel Corporation, Virginia, USA
David Kreitzer

Authors

Farhana Aleen
View author publications
You can also search for this author in PubMed Google Scholar
Vyacheslav P. Zakharin
View author publications
You can also search for this author in PubMed Google Scholar
Rakesh Krishnaiyer
View author publications
You can also search for this author in PubMed Google Scholar
Garima Gupta
View author publications
You can also search for this author in PubMed Google Scholar
David Kreitzer
View author publications
You can also search for this author in PubMed Google Scholar
Chang-Sun Lin Jr
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vyacheslav P. Zakharin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aleen, F., Zakharin, V.P., Krishnaiyer, R. et al. Automated Compiler Optimization of Multiple Vector Loads/Stores. Int J Parallel Prog 46, 471–503 (2018). https://doi.org/10.1007/s10766-016-0485-7

Download citation

Received: 18 July 2016
Accepted: 29 December 2016
Published: 09 January 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s10766-016-0485-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automated Compiler Optimization of Multiple Vector Loads/Stores

Abstract

Access this article

Similar content being viewed by others

Evaluating the Effectiveness of a Vector-Length-Agnostic Instruction Set

Using Arm’s scalable vector extension on stencil codes

PostSLP: Cross-Region Vectorization of Fully or Partially Vectorized Code

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automated Compiler Optimization of Multiple Vector Loads/Stores

Abstract

Access this article

Similar content being viewed by others

Evaluating the Effectiveness of a Vector-Length-Agnostic Instruction Set

Using Arm’s scalable vector extension on stencil codes

PostSLP: Cross-Region Vectorization of Fully or Partially Vectorized Code

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation