Abstract
With widening vectors and the proliferation of advanced vector instructions in today’s processors, vectorization plays an ever-increasing role in delivering application performance. Achieving the performance potential of this vector hardware has required significant support from the software level such as new explicit vector programming models and advanced vectorizing compilers. Today, with the combination of these software tools plus new SIMD ISA extensions like gather/scatter instructions it is not uncommon to find that even codes with complex and irregular data access patterns can be vectorized. In this paper we focus on these vectorized codes with irregular accesses, and show that while the best-in-class Intel Compiler Vectorizer does indeed provide speedup through efficient vectorization, there are some opportunities where clever program transformations can increase performance further. After identifying these opportunities, this paper describes two automatic compiler optimizations to target these data access patterns. The first optimization focuses on improving the performance for a group of adjacent gathers/scatters. The second optimization improves performance for a group of stencil vector accesses using more efficient SIMD instructions. Both optimizations are now implemented in the 17.0 version of the Intel Compiler. We evaluate the optimizations using an extensive set of micro-kernels, representative benchmarks and application kernels. On these benchmarks, we demonstrate performance gains of 3–750% on the Intel\({\textregistered }\) Xeon processor (Haswell—HSW), up to 25% on the Intel\({\textregistered }\) Xeon \(\hbox {Phi}^{\mathrm{TM}}\) coprocessor (Knights Corner—KNC), and up to 430% on the Intel\({\textregistered }\) Xeon \(\hbox {Phi}^{\mathrm{TM}}\) processor with AVX-512 instructions support (Knights Landing—KNL).
Similar content being viewed by others
References
Kennedy, R., et al.: Partial redundancy elimination in SSA Form. In: ACM TOPLAS (1999)
Briggs, P., Cooper, K.: Effective partial redundancy elimination. In: PLDI (1994)
Intel\({\textregistered }\) 64 and IA-32 Architectures Software Developer’s Manual
Intel’s Haswell CPU Microarchitecture. http://www.realworldtech.com/haswell-cpu/2/
Kamil, S. et al.: Implicit and explicit optimizations for stencil computations. In: MSPC ’06 (2006)
Caballero, D., et al.: Optimizing overlapped Memory Accesses in User-directed vectorization. In: ICS (2015)
Talla, D., John, L.K., Burger, D.: Bottlenecks in multimedia processing with SIMD style extensions and enhancements. IEEE Trans (August 2003)
Eichenberger, A.E., et al.: Vectorization for SIMD architectures with alignment constraints. In: PLD I (2004)
Nuzman, D., et al.: Auto-vectorization of interleaved data for SIMD. In: PLDI (June 2006)
Xu, S., Greg, D.: Efficient exploitation of hyper loop parallelism in vectorization*. 27th international workshop, LCPC (2014)
Henretty, T., et al.: A stencil compiler for short-vector SIMD architectures. In: ICS (2013)
http://www.drdobbs.com/go-parallel/article/print?articleId=224202549
Kahle, J.A., et al.: Introduction to the cell multiprocessor. IBM J. Res. Dev. 49(4/5), 589–604 (2005). http://dl.acm.org/citation.cfm?id=1148891
Franchetti, F., et al.: A SIMD vectorizing compiler for digital signal processing algorithms. In: IPDPS (2002)
Holewinsk, J., et al.: High-performance code generation for stencil computations on GPU architectures. In: ICS (2012)
Leupers, R.: Code selection for media processors with SIMD instructions. In: DATE ’00 (2000)
Heintze, N., Tardieu, O.: Ultra-fast aliasing analysis using CLA. In: PLDI (May 2001)
Gallagher, D.M.: Memory disambiguation to facilitate instruction-level parallelism compilation. Ph.D. thesis, Univ. of Illinois, Urbana, IL (1995)
Ghiya, R., et al.: On the importance of points-to analysis and other memory disambiguation methods for C programs. In: PLD I (2001)
Hwu, W.W., et al.: Compiler technology for future microprocessors. In Proc. of the IEEE (1995)
Dz-ching Ju, R., et al.: Probabilistic memory disambiguation and its application to data speculation. In: PACT’98 (1998)
Lowney, P.G., Freudenberger, S.M., Karzes, T.J., et al.: The multiflow trace scheduling compiler. J. Supercomput. 7, 51 (1993). doi:10.1007/BF01205182
Seonggun, K., et al.: Efficient SIMD code generation for irregular kernels. In: PPoPP (August 2012)
Larsen, S., et al.: Exploiting superword level parallelism with multimedia instruction sets. In: PLDI ’00 (2000)
Satish, N., et al.: Can traditional programming bridge the ninja performance gap for parallel computing applications? In: ISCA ’12 (2012)
Sreraman, N., et al.: A vectorizing compiler for multimedia extensions. Int. J. Parallel Program. 28(4), 363–400 (2000). http://dl.acm.org/citation.cfm?id=608743
Maleki, S., et al.: An evaluation of vectorizing compilers. In: PACT’11 (2011)
Kong, M., et al.: When polyhedral transformations meet SIMD code generation. In: PLDI’13 (2013)
Barik, R., et al.: Efficient selection of vector instructions using dynamic programming. In: MICRO (2010)
Kudriavtsev, A., et al.: Generation of permutations for SIMD processors. In: LCTES (July 2005)
Liu, J. et al.: A compiler framework for extracting superword level parallelism. In: PLDI’2012 (2012)
Dursun, H., et al.: In-core optimization of high-order stencil computations. In: PDPTA (2009)
Intel Corp. Intel\({\textregistered }\) Cilk™ Plus Language Extension Specification Version 1.2
Tian, X., et al.: Practical SIMD vectorization techniques for Intel\({\textregistered }\) xeon phi coprocessors. In: IPDPSW (May 2013)
Klemm, M. et al.: Extending openMP with vector constructs for multicore SIMD architectures. In: IWOMP’12 (2012)
https://software.intel.com/en-us/blogs/2014/11/24/what-is-new-for-x86-in-upcoming-gcc-50
https://software.intel.com/en-us/articles/algorithms-to-vectorize-load-groups
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Aleen, F., Zakharin, V.P., Krishnaiyer, R. et al. Automated Compiler Optimization of Multiple Vector Loads/Stores. Int J Parallel Prog 46, 471–503 (2018). https://doi.org/10.1007/s10766-016-0485-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-016-0485-7