research-article

Partial control-flow linearization

Authors:
Simon Moll

Saarland University, Germany

Saarland University, Germany
View Profile

,
Sebastian Hack

Saarland University, Germany

Saarland University, Germany
View Profile

PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and ImplementationJune 2018Pages 543–556https://doi.org/10.1145/3192366.3192413

Published:11 June 2018Publication History

PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation

Pages 543–556

ABSTRACT

If-conversion is a fundamental technique for vectorization. It accounts for the fact that in a SIMD program, several targets of a branch might be executed because of divergence. Especially for irregular data-parallel workloads, it is crucial to avoid if-converting non-divergent branches to increase SIMD utilization. In this paper, we present partial linearization, a simple and efficient if-conversion algorithm that overcomes several limitations of existing if-conversion techniques. In contrast to prior work, it has provable guarantees on which non-divergent branches are retained and will never duplicate code or insert additional branches. We show how our algorithm can be used in a classic loop vectorizer as well as to implement data-parallel languages such as ISPC or OpenCL. Furthermore, we implement prior vectorizer optimizations on top of partial linearization in a more general way. We evaluate the implementation of our algorithm in LLVM on a range of irregular data analytics kernels, a neutronics simulation benchmark and NAB, a molecular dynamics benchmark from SPEC2017 on AVX2, AVX512, and ARM Advanced SIMD machines and report speedups of up to 146 % over ICC, GCC and Clang O3.

Supplemental Material

p543-moll.webm

webm

115.3 MB

Download

Available for Download

zip

pldi18main-p234-p-aux.zip (480.7 KB)

Appendix of the paper "Partial Control-Flow Linearization", PLDI '18.

References

Timo Aila and Samuli Laine. 2009. Understanding the Efficiency of Ray Traversal on GPUs. In Proceedings of the Conference on High Performance Graphics 2009 (HPG '09). ACM, New York, NY, USA, 145-149. Google ScholarDigital Library
J. R. Allen, Ken Kennedy, Carrie Porterfield, and Joe Warren. 1983. Conversion of Control Dependence to Data Dependence. In Proceedings of the 10th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL '83). ACM, New York, NY, USA, 177-189. Google ScholarDigital Library
Jayvant Anantpur and Govindarajan R. 2014. Taming Control Divergence in GPUs through Control Flow Linearization. Springer Berlin Heidelberg, Berlin, Heidelberg, 133-153.Google Scholar
Krste Asanovic, Stephen W. Keckler, Yunsup Lee, Ronny Krashinsky, and Vinod Grover. 2013. Convergence and Scalarization for Data-parallel Architectures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (CGO '13). IEEE Computer Society, Washington, DC, USA, 1-11. Google ScholarDigital Library
Sara S. Baghsorkhi, Nalini Vasudevan, and YoufengWu. 2016. FlexVec: Auto-vectorization for Irregular Loops. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '16). ACM, New York, NY, USA, 697-710. Google ScholarDigital Library
Helge Bahmann, Nico Reissmann, Magnus Jahre, and Jan Christian Meyer. 2015. Perfect Reconstructability of Control Flow from Demand Dependence Graphs. ACM Trans. Archit. Code Optim. 11, 4, Article 66 (Jan. 2015), 25 pages. Google ScholarDigital Library
J A Blackard and D J Dean. 1999. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and Electronics in Agriculture vol. 24 (1999), 131-151.Google ScholarCross Ref
Bruno Coutinho, Diogo Sampaio, Fernando Magno Quintao Pereira, and Wagner Meira Jr. 2011. Divergence analysis and optimizations. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on. IEEE, 320-329. Google ScholarDigital Library
Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. 1991. Efficiently Computing Static Single Assignment Form and the Control Dependence Graph. ACM Trans. Program. Lang. Syst. 13, 4 (Oct. 1991), 451-490. Google ScholarDigital Library
Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu, and Sudhakar Yalamanchili. 2011. SIMD Re-convergence at Thread Frontiers. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). ACM, New York, NY, USA, 477-488. Google ScholarDigital Library
Jeanne Ferrante and Mary Mace. 1985. On Linearizing Parallel Code. In Proceedings of the 12th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL '85). ACM, New York, NY, USA, 179- 190. Google ScholarDigital Library
Jeanne Ferrante, Karl J Ottenstein, and Joe D Warren. 1987. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems (TOPLAS) 9, 3 (1987), 319-349. Google ScholarDigital Library
Michael Goldfarb, Youngjoon Jo, and Milind Kulkarni. 2013. General Transformations for GPU Execution of Tree Traversals. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, Article 10, 12 pages. Google ScholarDigital Library
Alexander G Gray and Andrew W Moore. 2001. N-body'problems in statistical learning. In Advances in neural information processing systems. 521-527. Google ScholarDigital Library
Michael Haidl, Simon Moll, Lars Klein, Huihui Sun, Sebastian Hack, and Sergei Gorlatch. 2017. PACXXv2 + RV: An LLVM-based Portable High-Performance Programming Model. In Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC'17). ACM, New York, NY, USA. Google ScholarDigital Library
Paul Havlak. 1994. Construction of thinned gated single-assignment form. Springer Berlin Heidelberg, Berlin, Heidelberg, 477-499.Google Scholar
M. S. Hecht and J. D. Ullman. 1974. Characterizations of Reducible Flow Graphs. J. ACM 21, 3 (July 1974), 367-375. Google ScholarDigital Library
N. Hegde, J. Liu, and M. Kulkarni. 2016. Treelogy: a benchmark suite for tree traversal applications. In 2016 IEEE International Symposium on Workload Characterization (IISWC). 1-2.Google Scholar
Kaixi Hou, Weifeng Liu, Hao Wang, and Wu-chun Feng. 2017. Fast Segmented Sort on GPUs. In Proceedings of the International Conference on Supercomputing (ICS '17). ACM, New York, NY, USA, Article 12, 10 pages. Google ScholarDigital Library
Youngjoon Jo, Michael Goldfarb, and Milind Kulkarni. 2013. Automatic Vectorization of Tree Traversals. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques (PACT '13). IEEE Press, Piscataway, NJ, USA, 363-374. http://dl.acm.org/citation.cfm?id=2523721.2523770 Google ScholarDigital Library
Ralf Karrenberg. 2015. Automatic SIMD Vectorization of SSA-based Control Flow Graphs. Springer Vieweg. Google ScholarDigital Library
Ralf Karrenberg and Sebastian Hack. 2011. Whole-function Vectorization. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '11). IEEE Computer Society, Washington, DC, USA, 141-150. http://dl.acm.org/citation.cfm?id=2190025.2190061 Google ScholarDigital Library
Ralf Karrenberg and Sebastian Hack. 2012. Improving Performance of OpenCL on CPUs. In Compiler Construction. Springer Berlin Heidelberg, Berlin, Heidelberg, 1-20. Google ScholarDigital Library
Milind Kulkarni, Martin Burtscher, Calin Casçaval, and Keshav Pingali. 2009. Lonestar: A Suite of Parallel Irregular Programs. In ISPASS '09: IEEE International Symposium on Performance Analysis of Systems and Software. http://iss.ices.utexas.edu/Publications/Papers/ispass2009.pdfGoogle Scholar
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Code Generation and Optimization, 2004. CGO 2004. International Symposium on. IEEE, 75-86. Google ScholarDigital Library
Marco Lattuada and Fabrizio Ferrandi. 2017. Exploiting vectorization in high level synthesis of nested irregular loops. Journal of Systems Architecture 75 (2017), 1-14. Google ScholarDigital Library
Yunsup Lee, Vinod Grover, Ronny Krashinsky, Mark Stephenson, Stephen W. Keckler, and Krste Asanovic. 2014. Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 101-113. Google ScholarDigital Library
Joseph CH Park and Mike Schlansker. 1991. On predicated execution. Hewlett-Packard Laboratories Palo Alto, California.Google Scholar
M. Pharr and W. R. Mark. 2012. ispc: A SPMD compiler for highperformance CPU programming. In 2012 Innovative Parallel Computing (InPar). 1-13.Google Scholar
Bin Ren, Youngjoon Jo, Sriram Krishnamoorthy, Kunal Agrawal, and Milind Kulkarni. 2015. Efficient Execution of Recursive Programs on Commodity Vector Hardware. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '15). ACM, New York, NY, USA, 509-520. Google ScholarDigital Library
Bin Ren, Sriram Krishnamoorthy, Kunal Agrawal, and Milind Kulkarni. 2017. Exploiting Vector and Multicore Parallelism for Recursive, Data- and Task-Parallel Programs. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17). ACM, New York, NY, USA, 117-130. Google ScholarDigital Library
Bin Ren, Tomi Poutanen, Todd Mytkowicz, Wolfram Schulte, Gagan Agrawal, and James R. Larus. 2013. SIMD Parallelization of Applications That Traverse Irregular Data Structures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (CGO '13). IEEE Computer Society, Washington, DC, USA, 1-10. Google ScholarDigital Library
Paul K. Romano, Nicholas E. Horelik, Bryan R. Herman, Adam G. Nelson, Benoit Forget, and Kord Smith. 2015. OpenMC: A state-of-the-art Monte Carlo code for research and development. Annals of Nuclear Energy 82 (2015), 90-97. Joint International Conference on Supercomputing in Nuclear Applications and Monte Carlo 2013, {SNA} + {MC} 2013. Pluri- and Trans-disciplinarity, Towards New Modeling and Numerical Simulation Paradigms.Google Scholar
N. Rotem and Y. Ben Asher. 2014. Block Unification IF-conversion for High Performance Architectures. IEEE Computer Architecture Letters 13, 1 (Jan 2014), 17-20. Google ScholarDigital Library
Diogo N. Sampaio, Louis-Noël Pouchet, and Fabrice Rastello. 2017. Simplification and Runtime Resolution of Data Dependence Constraints for Loop Transformations. In Proceedings of the International Conference on Supercomputing (ICS '17). ACM, New York, NY, USA, Article 10, 11 pages. Google ScholarDigital Library
Jaewook Shin. 2007. Introducing Control Flow into Vectorized Code. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT '07). IEEE Computer Society, Washington, DC, USA, 280-291. Google ScholarDigital Library
Jaewook Shin, Mary Hall, and Jacqueline Chame. 2005. Superword-Level Parallelism in the Presence of Control Flow. In Proceedings of the International Symposium on Code Generation and Optimization (CGO '05). IEEE Computer Society, Washington, DC, USA, 165-175. Google ScholarDigital Library
Jaewook Shin, Mary W. Hall, and Jacqueline Chame. 2009. Evaluating compiler technology for control-flow optimizations for multimedia extension architectures. Microprocessors and Microsystems 33, 4 (6 2009), 235-243. Google ScholarDigital Library
Standard Performance Evaluation Corporation (SPEC). 2017. SPEC CPU2017 Benchmark Descriptions.Google Scholar
Shahar Timnat, Ohad Shacham, and Ayal Zaks. 2014. Predicate vectors if you must. In Workshop on Programming Models for SIMD/Vector Processing.Google Scholar
John R Tramm, Andrew R Siegel, Tanzima Islam, and Martin Schulz. 2014. XSBench-the development and verification of a performance abstraction for Monte Carlo reactor analysis. The Role of Reactor Physics toward a Sustainable Future (PHYSOR) (2014).Google Scholar
Christian Wimmer and Hanspeter Mössenböck. 2005. Optimized Interval Splitting in a Linear Scan Register Allocator. In Proceedings of the 1st ACM/USENIX International Conference on Virtual Execution Environments (VEE '05). ACM, New York, NY, USA, 132-141. Google ScholarDigital Library

Index Terms

Partial control-flow linearization

Recommendations

Partial control-flow linearization
PLDI '18

If-conversion is a fundamental technique for vectorization. It accounts for the fact that in a SIMD program, several targets of a branch might be executed because of divergence. Especially for irregular data-parallel workloads, it is crucial to avoid if-...
Read More
Writing scalable SIMD programs with ISPC
WPMVP '14: Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing

Modern processors contain many resources for parallel execution. In addition to having multiple cores, processors can also contain vector functional units that are capable of performing a single operation on multiple inputs in parallel. Taking advantage ...
Read More
Outer-loop vectorization: revisited for short SIMD architectures
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Vectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2018
825 pages
ISBN:9781450356985
DOI:10.1145/3192366
General Chair:
Jeffrey S. Foster
University of Maryland at College Park, USA
,
Program Chair:
Dan Grossman
University of Washington, USA
ACM SIGPLAN Notices Volume 53, Issue 4
PLDI '18
April 2018
834 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3296979
Editor:
Matthew Fluet
Rodchester Institude of Technology
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 June 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Compiler optimizations
SIMD
SPMD
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate406of2,067submissions,20%
Upcoming Conference
PLDI '24

Sponsor:

sigplan

ACM SIGPLAN Conference on Programming Language Design and Implementation

June 24 - 28, 2024

Copenhagen , Denmark
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 26
  Total Citations
  View Citations
- 670
  Total Downloads
- Downloads (Last 12 months)53
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Partial control-flow linearization

PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Partial control-flow linearization

Writing scalable SIMD programs with ISPC

Outer-loop vectorization: revisited for short SIMD architectures