Skip to main content
Log in

Automated Compiler Optimization of Multiple Vector Loads/Stores

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

With widening vectors and the proliferation of advanced vector instructions in today’s processors, vectorization plays an ever-increasing role in delivering application performance. Achieving the performance potential of this vector hardware has required significant support from the software level such as new explicit vector programming models and advanced vectorizing compilers. Today, with the combination of these software tools plus new SIMD ISA extensions like gather/scatter instructions it is not uncommon to find that even codes with complex and irregular data access patterns can be vectorized. In this paper we focus on these vectorized codes with irregular accesses, and show that while the best-in-class Intel Compiler Vectorizer does indeed provide speedup through efficient vectorization, there are some opportunities where clever program transformations can increase performance further. After identifying these opportunities, this paper describes two automatic compiler optimizations to target these data access patterns. The first optimization focuses on improving the performance for a group of adjacent gathers/scatters. The second optimization improves performance for a group of stencil vector accesses using more efficient SIMD instructions. Both optimizations are now implemented in the 17.0 version of the Intel Compiler. We evaluate the optimizations using an extensive set of micro-kernels, representative benchmarks and application kernels. On these benchmarks, we demonstrate performance gains of 3–750% on the Intel\({\textregistered }\) Xeon processor (Haswell—HSW), up to 25% on the Intel\({\textregistered }\) Xeon \(\hbox {Phi}^{\mathrm{TM}}\) coprocessor (Knights Corner—KNC), and up to 430% on the Intel\({\textregistered }\) Xeon \(\hbox {Phi}^{\mathrm{TM}}\) processor with AVX-512 instructions support (Knights Landing—KNL).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27

Similar content being viewed by others

References

  1. Kennedy, R., et al.: Partial redundancy elimination in SSA Form. In: ACM TOPLAS (1999)

  2. Briggs, P., Cooper, K.: Effective partial redundancy elimination. In: PLDI (1994)

  3. Intel\({\textregistered }\) 64 and IA-32 Architectures Software Developer’s Manual

  4. Intel’s Haswell CPU Microarchitecture. http://www.realworldtech.com/haswell-cpu/2/

  5. Kamil, S. et al.: Implicit and explicit optimizations for stencil computations. In: MSPC ’06 (2006)

  6. Caballero, D., et al.: Optimizing overlapped Memory Accesses in User-directed vectorization. In: ICS (2015)

  7. Talla, D., John, L.K., Burger, D.: Bottlenecks in multimedia processing with SIMD style extensions and enhancements. IEEE Trans (August 2003)

  8. Eichenberger, A.E., et al.: Vectorization for SIMD architectures with alignment constraints. In: PLD I (2004)

  9. Nuzman, D., et al.: Auto-vectorization of interleaved data for SIMD. In: PLDI (June 2006)

  10. Xu, S., Greg, D.: Efficient exploitation of hyper loop parallelism in vectorization*. 27th international workshop, LCPC (2014)

  11. Henretty, T., et al.: A stencil compiler for short-vector SIMD architectures. In: ICS (2013)

  12. http://www.drdobbs.com/go-parallel/article/print?articleId=224202549

  13. Kahle, J.A., et al.: Introduction to the cell multiprocessor. IBM J. Res. Dev. 49(4/5), 589–604 (2005). http://dl.acm.org/citation.cfm?id=1148891

  14. Franchetti, F., et al.: A SIMD vectorizing compiler for digital signal processing algorithms. In: IPDPS (2002)

  15. Holewinsk, J., et al.: High-performance code generation for stencil computations on GPU architectures. In: ICS (2012)

  16. Leupers, R.: Code selection for media processors with SIMD instructions. In: DATE ’00 (2000)

  17. Heintze, N., Tardieu, O.: Ultra-fast aliasing analysis using CLA. In: PLDI (May 2001)

  18. Gallagher, D.M.: Memory disambiguation to facilitate instruction-level parallelism compilation. Ph.D. thesis, Univ. of Illinois, Urbana, IL (1995)

  19. Ghiya, R., et al.: On the importance of points-to analysis and other memory disambiguation methods for C programs. In: PLD I (2001)

  20. Hwu, W.W., et al.: Compiler technology for future microprocessors. In Proc. of the IEEE (1995)

  21. Dz-ching Ju, R., et al.: Probabilistic memory disambiguation and its application to data speculation. In: PACT’98 (1998)

  22. Lowney, P.G., Freudenberger, S.M., Karzes, T.J., et al.: The multiflow trace scheduling compiler. J. Supercomput. 7, 51 (1993). doi:10.1007/BF01205182

    Article  Google Scholar 

  23. Seonggun, K., et al.: Efficient SIMD code generation for irregular kernels. In: PPoPP (August 2012)

  24. Larsen, S., et al.: Exploiting superword level parallelism with multimedia instruction sets. In: PLDI ’00 (2000)

  25. http://impact.crhc.illinois.edu/parboil/parboil.aspx

  26. Satish, N., et al.: Can traditional programming bridge the ninja performance gap for parallel computing applications? In: ISCA ’12 (2012)

  27. Sreraman, N., et al.: A vectorizing compiler for multimedia extensions. Int. J. Parallel Program. 28(4), 363–400 (2000). http://dl.acm.org/citation.cfm?id=608743

  28. Maleki, S., et al.: An evaluation of vectorizing compilers. In: PACT’11 (2011)

  29. Kong, M., et al.: When polyhedral transformations meet SIMD code generation. In: PLDI’13 (2013)

  30. Barik, R., et al.: Efficient selection of vector instructions using dynamic programming. In: MICRO (2010)

  31. Kudriavtsev, A., et al.: Generation of permutations for SIMD processors. In: LCTES (July 2005)

  32. Liu, J. et al.: A compiler framework for extracting superword level parallelism. In: PLDI’2012 (2012)

  33. Dursun, H., et al.: In-core optimization of high-order stencil computations. In: PDPTA (2009)

  34. Intel Corp. Intel\({\textregistered }\) Cilk™ Plus Language Extension Specification Version 1.2

  35. Tian, X., et al.: Practical SIMD vectorization techniques for Intel\({\textregistered }\) xeon phi coprocessors. In: IPDPSW (May 2013)

  36. Klemm, M. et al.: Extending openMP with vector constructs for multicore SIMD architectures. In: IWOMP’12 (2012)

  37. https://software.intel.com/en-us/blogs/2014/11/24/what-is-new-for-x86-in-upcoming-gcc-50

  38. https://software.intel.com/en-us/articles/algorithms-to-vectorize-load-groups

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vyacheslav P. Zakharin.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aleen, F., Zakharin, V.P., Krishnaiyer, R. et al. Automated Compiler Optimization of Multiple Vector Loads/Stores. Int J Parallel Prog 46, 471–503 (2018). https://doi.org/10.1007/s10766-016-0485-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-016-0485-7

Keywords

Navigation