skip to main content
10.1145/3192366.3192413acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Partial control-flow linearization

Published:11 June 2018Publication History

ABSTRACT

If-conversion is a fundamental technique for vectorization. It accounts for the fact that in a SIMD program, several targets of a branch might be executed because of divergence. Especially for irregular data-parallel workloads, it is crucial to avoid if-converting non-divergent branches to increase SIMD utilization. In this paper, we present partial linearization, a simple and efficient if-conversion algorithm that overcomes several limitations of existing if-conversion techniques. In contrast to prior work, it has provable guarantees on which non-divergent branches are retained and will never duplicate code or insert additional branches. We show how our algorithm can be used in a classic loop vectorizer as well as to implement data-parallel languages such as ISPC or OpenCL. Furthermore, we implement prior vectorizer optimizations on top of partial linearization in a more general way. We evaluate the implementation of our algorithm in LLVM on a range of irregular data analytics kernels, a neutronics simulation benchmark and NAB, a molecular dynamics benchmark from SPEC2017 on AVX2, AVX512, and ARM Advanced SIMD machines and report speedups of up to 146 % over ICC, GCC and Clang O3.

Skip Supplemental Material Section

Supplemental Material

p543-moll.webm

webm

115.3 MB

References

  1. Timo Aila and Samuli Laine. 2009. Understanding the Efficiency of Ray Traversal on GPUs. In Proceedings of the Conference on High Performance Graphics 2009 (HPG '09). ACM, New York, NY, USA, 145-149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. R. Allen, Ken Kennedy, Carrie Porterfield, and Joe Warren. 1983. Conversion of Control Dependence to Data Dependence. In Proceedings of the 10th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL '83). ACM, New York, NY, USA, 177-189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jayvant Anantpur and Govindarajan R. 2014. Taming Control Divergence in GPUs through Control Flow Linearization. Springer Berlin Heidelberg, Berlin, Heidelberg, 133-153.Google ScholarGoogle Scholar
  4. Krste Asanovic, Stephen W. Keckler, Yunsup Lee, Ronny Krashinsky, and Vinod Grover. 2013. Convergence and Scalarization for Data-parallel Architectures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (CGO '13). IEEE Computer Society, Washington, DC, USA, 1-11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Sara S. Baghsorkhi, Nalini Vasudevan, and YoufengWu. 2016. FlexVec: Auto-vectorization for Irregular Loops. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '16). ACM, New York, NY, USA, 697-710. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Helge Bahmann, Nico Reissmann, Magnus Jahre, and Jan Christian Meyer. 2015. Perfect Reconstructability of Control Flow from Demand Dependence Graphs. ACM Trans. Archit. Code Optim. 11, 4, Article 66 (Jan. 2015), 25 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J A Blackard and D J Dean. 1999. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and Electronics in Agriculture vol. 24 (1999), 131-151.Google ScholarGoogle ScholarCross RefCross Ref
  8. Bruno Coutinho, Diogo Sampaio, Fernando Magno Quintao Pereira, and Wagner Meira Jr. 2011. Divergence analysis and optimizations. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on. IEEE, 320-329. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. 1991. Efficiently Computing Static Single Assignment Form and the Control Dependence Graph. ACM Trans. Program. Lang. Syst. 13, 4 (Oct. 1991), 451-490. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu, and Sudhakar Yalamanchili. 2011. SIMD Re-convergence at Thread Frontiers. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). ACM, New York, NY, USA, 477-488. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jeanne Ferrante and Mary Mace. 1985. On Linearizing Parallel Code. In Proceedings of the 12th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL '85). ACM, New York, NY, USA, 179- 190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jeanne Ferrante, Karl J Ottenstein, and Joe D Warren. 1987. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems (TOPLAS) 9, 3 (1987), 319-349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Michael Goldfarb, Youngjoon Jo, and Milind Kulkarni. 2013. General Transformations for GPU Execution of Tree Traversals. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, Article 10, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Alexander G Gray and Andrew W Moore. 2001. N-body'problems in statistical learning. In Advances in neural information processing systems. 521-527. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Michael Haidl, Simon Moll, Lars Klein, Huihui Sun, Sebastian Hack, and Sergei Gorlatch. 2017. PACXXv2 + RV: An LLVM-based Portable High-Performance Programming Model. In Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC'17). ACM, New York, NY, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Paul Havlak. 1994. Construction of thinned gated single-assignment form. Springer Berlin Heidelberg, Berlin, Heidelberg, 477-499.Google ScholarGoogle Scholar
  17. M. S. Hecht and J. D. Ullman. 1974. Characterizations of Reducible Flow Graphs. J. ACM 21, 3 (July 1974), 367-375. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. Hegde, J. Liu, and M. Kulkarni. 2016. Treelogy: a benchmark suite for tree traversal applications. In 2016 IEEE International Symposium on Workload Characterization (IISWC). 1-2.Google ScholarGoogle Scholar
  19. Kaixi Hou, Weifeng Liu, Hao Wang, and Wu-chun Feng. 2017. Fast Segmented Sort on GPUs. In Proceedings of the International Conference on Supercomputing (ICS '17). ACM, New York, NY, USA, Article 12, 10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Youngjoon Jo, Michael Goldfarb, and Milind Kulkarni. 2013. Automatic Vectorization of Tree Traversals. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques (PACT '13). IEEE Press, Piscataway, NJ, USA, 363-374. http://dl.acm.org/citation.cfm?id=2523721.2523770 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ralf Karrenberg. 2015. Automatic SIMD Vectorization of SSA-based Control Flow Graphs. Springer Vieweg. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ralf Karrenberg and Sebastian Hack. 2011. Whole-function Vectorization. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '11). IEEE Computer Society, Washington, DC, USA, 141-150. http://dl.acm.org/citation.cfm?id=2190025.2190061 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ralf Karrenberg and Sebastian Hack. 2012. Improving Performance of OpenCL on CPUs. In Compiler Construction. Springer Berlin Heidelberg, Berlin, Heidelberg, 1-20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Milind Kulkarni, Martin Burtscher, Calin Casçaval, and Keshav Pingali. 2009. Lonestar: A Suite of Parallel Irregular Programs. In ISPASS '09: IEEE International Symposium on Performance Analysis of Systems and Software. http://iss.ices.utexas.edu/Publications/Papers/ispass2009.pdfGoogle ScholarGoogle Scholar
  25. Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Code Generation and Optimization, 2004. CGO 2004. International Symposium on. IEEE, 75-86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Marco Lattuada and Fabrizio Ferrandi. 2017. Exploiting vectorization in high level synthesis of nested irregular loops. Journal of Systems Architecture 75 (2017), 1-14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yunsup Lee, Vinod Grover, Ronny Krashinsky, Mark Stephenson, Stephen W. Keckler, and Krste Asanovic. 2014. Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 101-113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Joseph CH Park and Mike Schlansker. 1991. On predicated execution. Hewlett-Packard Laboratories Palo Alto, California.Google ScholarGoogle Scholar
  29. M. Pharr and W. R. Mark. 2012. ispc: A SPMD compiler for highperformance CPU programming. In 2012 Innovative Parallel Computing (InPar). 1-13.Google ScholarGoogle Scholar
  30. Bin Ren, Youngjoon Jo, Sriram Krishnamoorthy, Kunal Agrawal, and Milind Kulkarni. 2015. Efficient Execution of Recursive Programs on Commodity Vector Hardware. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '15). ACM, New York, NY, USA, 509-520. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Bin Ren, Sriram Krishnamoorthy, Kunal Agrawal, and Milind Kulkarni. 2017. Exploiting Vector and Multicore Parallelism for Recursive, Data- and Task-Parallel Programs. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17). ACM, New York, NY, USA, 117-130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Bin Ren, Tomi Poutanen, Todd Mytkowicz, Wolfram Schulte, Gagan Agrawal, and James R. Larus. 2013. SIMD Parallelization of Applications That Traverse Irregular Data Structures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (CGO '13). IEEE Computer Society, Washington, DC, USA, 1-10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Paul K. Romano, Nicholas E. Horelik, Bryan R. Herman, Adam G. Nelson, Benoit Forget, and Kord Smith. 2015. OpenMC: A state-of-the-art Monte Carlo code for research and development. Annals of Nuclear Energy 82 (2015), 90-97. Joint International Conference on Supercomputing in Nuclear Applications and Monte Carlo 2013, {SNA} + {MC} 2013. Pluri- and Trans-disciplinarity, Towards New Modeling and Numerical Simulation Paradigms.Google ScholarGoogle Scholar
  34. N. Rotem and Y. Ben Asher. 2014. Block Unification IF-conversion for High Performance Architectures. IEEE Computer Architecture Letters 13, 1 (Jan 2014), 17-20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Diogo N. Sampaio, Louis-Noël Pouchet, and Fabrice Rastello. 2017. Simplification and Runtime Resolution of Data Dependence Constraints for Loop Transformations. In Proceedings of the International Conference on Supercomputing (ICS '17). ACM, New York, NY, USA, Article 10, 11 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jaewook Shin. 2007. Introducing Control Flow into Vectorized Code. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT '07). IEEE Computer Society, Washington, DC, USA, 280-291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jaewook Shin, Mary Hall, and Jacqueline Chame. 2005. Superword-Level Parallelism in the Presence of Control Flow. In Proceedings of the International Symposium on Code Generation and Optimization (CGO '05). IEEE Computer Society, Washington, DC, USA, 165-175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jaewook Shin, Mary W. Hall, and Jacqueline Chame. 2009. Evaluating compiler technology for control-flow optimizations for multimedia extension architectures. Microprocessors and Microsystems 33, 4 (6 2009), 235-243. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Standard Performance Evaluation Corporation (SPEC). 2017. SPEC CPU2017 Benchmark Descriptions.Google ScholarGoogle Scholar
  40. Shahar Timnat, Ohad Shacham, and Ayal Zaks. 2014. Predicate vectors if you must. In Workshop on Programming Models for SIMD/Vector Processing.Google ScholarGoogle Scholar
  41. John R Tramm, Andrew R Siegel, Tanzima Islam, and Martin Schulz. 2014. XSBench-the development and verification of a performance abstraction for Monte Carlo reactor analysis. The Role of Reactor Physics toward a Sustainable Future (PHYSOR) (2014).Google ScholarGoogle Scholar
  42. Christian Wimmer and Hanspeter Mössenböck. 2005. Optimized Interval Splitting in a Linear Scan Register Allocator. In Proceedings of the 1st ACM/USENIX International Conference on Virtual Execution Environments (VEE '05). ACM, New York, NY, USA, 132-141. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Partial control-flow linearization

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation
          June 2018
          825 pages
          ISBN:9781450356985
          DOI:10.1145/3192366

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 11 June 2018

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate406of2,067submissions,20%

          Upcoming Conference

          PLDI '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader