skip to main content
10.1145/3330345.3331059acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

WCCV: improving the vectorization of IF-statements with warp-coherent conditions

Published: 26 June 2019 Publication History

Abstract

When vectorizing programs for modern processors with SIMD extensions, IF-statements pose a challenge: existing vectorization approaches often introduce redundant computations or they resort to inefficient masked instructions.
In this paper, we introduce a new notion of warp-coherence for conditions that exhibit coherent run-time behavior on different lanes of a vector register. We demonstrate that warp-coherent conditions appear frequently in practice. We present Warp-Coherent Condition Vectorization (WCCV) - an approach to detecting and optimizing IF-statements with warp-coherent conditions - to efficiently vectorize programs with IF-statements while avoiding the overhead of existing methods. WCCV detects warp-coherent conditions via the affine analysis of conditional boolean expressions and branch predication of IF-statements; the runtime code generated by WCCV avoids redundant computations and masked instructions. We employ auto-tuning to find the optimal benefit-overhead ratio for WCCV. We implement WCCV on top of Region Vectorizer (RV) - an LLVM-based vectorizing compiler, and we conduct experiments on the Rodinia benchmark suite, achieving a mean speedup of 1.14× over the original vectorized and optimized code, and speedup between 0.98× and 7.02× over the scalar code on Skylake with AVX512.

References

[1]
John R. Allen, Ken Kennedy, Carrie Porterfield, et al. 1983. Conversion of Control Dependence to Data Dependence. In Proceedings of the Symposium on Principles of Programming Languages (POPL), Austin, Texas, USA. 177--189.
[2]
Bowen Alpern, Mark N Wegman, and F Kenneth Zadeck. 1988. Detecting equality of variables in programs. In Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages (POPL). ACM, 1--11.
[3]
AOBench. 2019. http://code.google.eom/p/aobench. accessed May 2019.
[4]
Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, et al. 2006. The landscape of parallel computing research: A view from Berkeley. Technical Report. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley.
[5]
OpenMP Architecture Review Board. {n. d.}. OpenMP Application Programming Interface. accessed May 2019.
[6]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), October 4--6, 2009, Austin, TX, USA. 44--54.
[7]
Free Software Foundation. {n. d.}. Using the GNU Compiler Collection (GCC). https://gcc.gnu.org/onlinedocs/gcc/. accessed May 2019.
[8]
Michael Haidl, Simon Moll, Lars Klein, Huihui Sun, Sebastian Hack, and Sergei Gorlatch. 2017. PACXXv2 + RV: An LLVM-based Portable High-Performance Programming Model. In Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC@SC). 7:1--7:12.
[9]
Intel. {n. d.}. Intel 64 and IA-32 Architectures Optimization Reference Manual. accessed May 2019.
[10]
Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO), Chamonix, France. 141--150.
[11]
Khronos Group. {n. d.}. The open standard for parallel programming of heterogeneous systems. accessed May 2019.
[12]
Hee-Seok Kim, Izzat El Hajj, John A Stratton, and Wen-Mei W Hwu. 2014. Multi-tier Dynamic Vectorization for Translating GPU Optimizations into CPU Performance. Center for Reliable and High-Performance Computing (2014).
[13]
Samuel Larsen and Saman P. Amarasinghe. 2000. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI), Vancouver, Britith Columbia, Canada. 145--156.
[14]
Chris Lattner and Vikram S. Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO), San Jose, CA, USA. IEEE, 75--88.
[15]
Yunsup Lee, Ronny Krashinsky, Vinod Grover, Stephen W. Keckler, and Krste Asanovic. 2013. Convergence and scalarization for data-parallel architectures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), February 23--27, 2013, Shenzhen, China. 32:1--32:11.
[16]
Simon Moll. {n. d.}. The Region Vectorizer(RV). https://github.com/cdl-saarland/rv. accessed May 2019.
[17]
Simon Moll and Sebastian Hack. 2018. Partial Control-Flow Linearization. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI), NewYork, NY, USA.
[18]
NVIDIA. {n. d.}. CUDA COMPILER DRIVER NVCC. accessed May 2019.
[19]
Matt Pharr and William R Mark. 2012. ispc: A SPMD compiler for high-performance CPU programming. In Innovative Parallel Computing (InPar), 2012. IEEE, 1--13.
[20]
Angela Pohl, Biagio Cosenza, and Ben H. H. Juurlink. 2018. Control Flow Vectorization for ARM NEON. In Proceedings of the 21st International Workshop on Software and Compilers for Embedded Systems (SCOPES), May 28--30, 2018, Sankt Goar, Germany, 66--75.
[21]
Ari Rasch and Sergei Gorlatch. 2018. ATF: A generic directive-based auto-tuning framework. Concurrency and Computation: Practice and Experience (2018).
[22]
Ruyman Reyes and Victor Lomüller. 2015. SYCL: Single-source C++ accelerator programming. In Proceedings of the International Conference on Parallel Computing (ParCo), 1--4 September 2015, Edinburgh, Scotland, UK. 673--682.
[23]
Diogo Sampaio, Rafael Martins de Souza, Sylvain Collange, and Fernando Magno Quintão Pereira. 2013. Divergence analysis. ACM Transactions on Programming Language and Systems 35, 4 (2013), 13:1--13:36.
[24]
Thomas Schaub, Simon Moll, Ralf Karrenberg, and Sebastian Hack. 2015. The Impact of the SIMD Width on Control-Flow and Memory Divergence. ACM Transactions on Architecture and Code Optimization 11 (01 2015), 1--25.
[25]
Jaewook Shin, Mary W. Hall, and Jacqueline Chame. 2005. Superword-Level Parallelism in the Presence of Control Flow. In 3nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO), 20--23 March 2005, San Jose, CA, USA. 165--175.
[26]
Jaewook Shin, Mary W. Hall, and Jacqueline Chame. 2009. Evaluating compiler technology for control-flow optimizations for multimedia extension architectures. Microprocessors and Microsystems - Embedded Hardware Design 33, 4 (2009), 235--243.
[27]
N. Sreraman and R. Govindarajan. 2000. A Vectorizing Compiler for Multimedia Extensions. International Journal of Parallel Programming 28 (2000), 363--400.
[28]
Huihui Sun, Sergei Gorlatch, and Rongcai Zhao. 2018. Refactoring Loops with Nested IFs for SIMD Extensions Without Masked Instructions. In Euro-Par 2018: Parallel Processing Workshops - Euro-Par 2018 International Workshops, Turin, Italy, August 27--28, 2018, Revised Selected Papers. 769--781.
[29]
J. Thomas, FE Allen, and J Cocke. 1971. A catalogue of optimizing transformations. Englewood Cliffs, N.J.: Prentice-Hall. 1--30 pages.
[30]
Shahar Timnat, Ohad Shacham, and Ayal Zaks. 2014. Predicate vectors if you must. In Workshop on Programming Models for SIMD/Vector Processing (WPMVP), Feburary 16th, 2014, Orlando, Florida, USA.
[31]
Haichuan Wang, Peng Wu, Ilie Gabriel Tanase, Mauricio J. Serrano, and José E. Moreira. 2014. Simple, portable and fast SIMD intrinsic programming: generic simd library. In Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing (WPMVP), February 16, 2014, Orlando, Florida, USA, 9--16.

Cited By

View all
  • (2025)pyATF: Constraint-Based Auto-Tuning in PythonProceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction10.1145/3708493.3712682(35-47)Online publication date: 25-Feb-2025
  • (2022)Combining Run-Time Checks and Compile-Time Analysis to Improve Control Flow Auto-VectorizationProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569663(439-450)Online publication date: 8-Oct-2022
  • (2022)Vectorizing divergent control flow with active-lane consolidation on long-vector architecturesThe Journal of Supercomputing10.1007/s11227-022-04359-w78:10(12553-12588)Online publication date: 7-Mar-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '19: Proceedings of the ACM International Conference on Supercomputing
June 2019
533 pages
ISBN:9781450360791
DOI:10.1145/3330345
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. IF-statements
  2. SPMD-on-SIMD
  3. compiler optimization
  4. vectorization
  5. warp-coherence

Qualifiers

  • Research-article

Funding Sources

Conference

ICS '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)3
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)pyATF: Constraint-Based Auto-Tuning in PythonProceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction10.1145/3708493.3712682(35-47)Online publication date: 25-Feb-2025
  • (2022)Combining Run-Time Checks and Compile-Time Analysis to Improve Control Flow Auto-VectorizationProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569663(439-450)Online publication date: 8-Oct-2022
  • (2022)Vectorizing divergent control flow with active-lane consolidation on long-vector architecturesThe Journal of Supercomputing10.1007/s11227-022-04359-w78:10(12553-12588)Online publication date: 7-Mar-2022
  • (2021)Efficient Auto-Tuning of Parallel Programs with Interdependent Tuning Parameters via Auto-Tuning Framework (ATF)ACM Transactions on Architecture and Code Optimization10.1145/342709318:1(1-26)Online publication date: 20-Jan-2021
  • (2019)Vectorizing programs with IF-statements for processors with SIMD extensionsThe Journal of Supercomputing10.1007/s11227-019-03057-4Online publication date: 11-Nov-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media