Abstract
Vectorization of programs is crucial for achieving high performance on modern processors with SIMD (Single Instruction Multiple Data) extensions. Programs with IF-statements suffer from control flow divergence that seriously complicates automatic vectorization. Therefore, contemporary compilers employ the IF-conversion approach to convert control flow to data flow, which relies on using predicated execution techniques (i.e., masked or select SIMD instructions). In this paper, we enhance the compiler’s capabilities to generate efficiently vectorized code for processors without masked instructions. We improve the state of the art in program vectorization by developing a novel approach—IF-select transformation—which is applicable to arbitrarily nested IF-statements. We implement our approach in the open-source Open64 compiler and evaluate its performance on the SW26010 processor used in the Sunway TaihuLight supercomputer (currently #3 in the TOP500 list) that does not support masked instructions. We extend our vectorization approach by providing an additional LLVM optimization pass to reduce the amount of masked memory accesses on processors without masked instructions, e.g., IBM Power8 and ARMCortex-A8. Experimental results demonstrate the performance advantages of the suggested vectorization techniques.
Similar content being viewed by others
References
Allen JR, Kennedy K, Porterfield C et al (1983) Conversion of control dependence to data dependence. In: Proceedings of the symposium on principles of programming languages (POPL), Austin, Texas, USA, pp 177–189. https://doi.org/10.1145/567067.567085
AMD (2012) Using the x86 Open64 compiler suite. For x86 Open64 version 4.5.2
Barton C, Tal A, Blainey B, Amaral JN (2005) Generalized index-set splitting. In: Bodik R (ed) Compiler construction. Springer, Berlin, pp 106–120
Bik AJC, Girkar M, Grey PM, Tian X (2002) Automatic intra-register vectorization for the Intel® architecture. Int J Parallel Program 30(2):65–98. https://doi.org/10.1023/A:1014230429447
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S, Skadron K (2009) Rodinia: A benchmark suite for heterogeneous computing. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), Austin, TX, USA, pp 44–54. https://doi.org/10.1109/IISWC.2009.5306797
Cooper K, Torczon L (2011) Engineering a compiler. Elsevier, Amsterdam
Danalis A, Marin G, McCurdy C, Meredith JS, Roth PC, Spafford K, Tipparaju V, Vetter JS (2010) The scalable heterogeneous computing (shoc) benchmark suite. In: Proceedings of the 3rdWorkshop on General-Purpose Computation on Graphics Processing Units, ACM, pp 63–74. https://doi.org/10.1145/1735688.1735702
Free Software Foundation (2019) Using the GNU Compiler Collection (GCC). https://gcc.gnu.org/onlinedocs/gcc/. Accessed 24 May 2019
Fu H, Liao J, Yang J et al (2016) The Sunway TaihuLight supercomputer: system and applications. Sci China Inf Sci 59:1–16. https://doi.org/10.1007/s11432-016-5588-7
Henning JL (2006) SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput Archit News 34(4):1–17. https://doi.org/10.1145/1186736.1186737
Intel (2019) Intel 64 and IA-32 Architectures Optimization Reference Manual. Accessed May 2019
Intel (2017) Intel C++ Compiler Developer Guide and Reference. Version 18.0
Karrenberg R, Hack S (2011) Whole-function vectorization. In: Proceedings of the international symposium on code generation and optimization (CGO), Chamonix, France, pp 141–150. https://doi.org/10.1109/CGO.2011.5764682
Larsen S, Amarasinghe SP (2000) Exploiting superword level parallelism with multimedia instruction sets. In: Proceedings of the Conference on Programming Language Design and Implementation (PLDI), Vancouver, BC, Canada, pp 145–156. https://doi.org/10.1145/358438.349320
Lattner C, Adve VS (2004) LLVM: a compilation framework for lifelong program analysis & transformation. In: Proceedings of the international symposium on code generation and optimization (CGO), San Jose, CA, USA, pp 75–88. https://doi.org/10.1109/CGO.2004.1281665
Lokuciejewski P, Gedikli F, Marwedel P (2009) Accelerating WCET-driven optimizations by the invariant path paradigm: a case study of loop unswitching. In: Proceedings of the 12th international workshop on software and compilers for embedded systems, SCOPES ’09. ACM, New York, NY, USA, pp 11–20. http://dl.acm.org/citation.cfm?id=1543820.1543823
Moll S (2019) The Region Vectorizer (RV). https://github.com/cdl-saarland/rv. Accessed May 2019
Moll S, Hack S (2018) Partial control-flow linearization. In: Proceedings of the Conference on Programming Language Design and Implementation (PLDI), New York, NY, USA. https://doi.org/10.1145/3192366.3192413
Pharr M, Mark WR (2012) ispc: a SPMD compiler for high-performance CPU programming. In: Innovative parallel computing (InPar). IEEE, pp 1–13. https://doi.org/10.1109/InPar.2012.6339601
Pohl A, Cosenza B, Juurlink BHH (2018) Control flow vectorization for ARM NEON. In: Proceedings of the 21st international workshop on software and compilers for embedded systems (SCOPES), May 28–30, 2018, Sankt Goar, Germany, pp 66–75. https://doi.org/10.1145/3207719.3207721
Shin J, Hall MW, Chame J (2005) Superword-level parallelism in the presence of control flow. In: Proceedings of the international symposium on code generation and optimization (CGO), San Jose, CA, USA, pp 165–175. https://doi.org/10.1109/cgo.2005.33
Shin J, Hall MW, Chame J (2009) Evaluating compiler technology for control-flow optimizations for multimedia extension architectures. Microprocess Microsyst Embed Hardw Des 33(4):235–243. https://doi.org/10.1016/j.micpro.2009.02.002
Sreraman N, Govindarajan R (2000) A vectorizing compiler for multimedia extensions. Int J Parallel Program 28:363–400. https://doi.org/10.1023/A:1007559022013
Sujon MH, Whaley RC, Yi Q (2013) Vectorization past dependent branches through speculation. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT ’13. IEEE Press, Piscataway, NJ, USA, pp 353–362. http://dl.acm.org/citation.cfm?id=2523721.2523769
Sun H, Fey F, Zhao J, Gorlatch S (2019) WCCV: Improving the vectorization of IF-statements with warp-coherent conditions. In: Proceedings of the 2018 International Conference on Supercomputing, ICS ’19. ACM, New York, NY, USA, pp 319–329. https://doi.org/10.1145/3330345.3331059
Tanaka H, Ota Y, Matsumoto N, Hieda T, Takeuchi Y, Imai M (2010) A new compilation technique for SIMD code generation across basic block boundaries. In: 2010 15th Asia and South Pacific Design Automation Conference (ASP-DAC), pp 101–106. https://doi.org/10.1109/ASPDAC.2010.5419911
Thomas J, Allen F, Cocke J (1971) A catalogue of optimizing transformations. Prentice-Hall, Englewood Cliffs
TOP500: https://www.top500.org/lists/2018/11/. Accessed 24 May 2019
Acknowledgements
This research is supported by the Chinese Scholarship Council (CSC) scholarship, and by the German Federal Ministry of Education and Research (BMBF) in the Project HPC2SE. Thanks are due to the National Supercomputing Center in Wuxi/China for providing access to the Sunway TaihuLight Supercomputer.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sun, H., Gorlatch, S. & Zhao, R. Vectorizing programs with IF-statements for processors with SIMD extensions. J Supercomput 76, 4731–4746 (2020). https://doi.org/10.1007/s11227-019-03057-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-019-03057-4