ABSTRACT
Optimizing compilers employ a rich set of transformations that generate highly efficient code for a variety of source languages and target architectures. These transformations typically operate on general control flow constructs which trigger a range of optimization opportunities, such as moving code to less frequently executed paths, and more. Regular loop nests are specifically relevant for accelerating certain domains, leveraging architectural features including vector instructions, hardware-controlled loops and data flows, provided their internal control-flow is eliminated. Compilers typically apply predicating if-conversion late, in their backend, to remove control-flow undesired by the target. Until then, transformations triggered by control-flow constructs that are destined to be removed may end up doing more harm than good. We present an approach that leverages the existing powerful and general optimization flow of LLVM when compiling for targets without control-flow in loops. Rather than trying to teach various transformations how to avoid misoptimizing for such targets, we propose to introduce an aggressive if-conversion pass as early as possible, along with carefully addressing pass-ordering implications. This solution outperforms the traditional compilation flow with only a modest tuning effort, thereby offering a robust and promising compilation approach for branch-restricted targets.
- John R. Allen, Ken Kennedy, Carrie Porterfield, and Joe Warren. 1983. Conversion of Control Dependence to Data Dependence. In Proceedings of the 10th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL ’83). Association for Computing Machinery, New York, NY, USA. 177–189. isbn:0897910907 https://doi.org/10.1145/567067.567085 Google ScholarDigital Library
- David I. August, Wen Mei W. Hwu, and Scott A. Mahlke. 1999. Partial reverse if-conversion framework for balancing control flow and predication. International Journal of Parallel Programming, 27, 5 (1999), 381–423. issn:0885-7458 https://doi.org/10.1023/A:1018787007582 Google ScholarDigital Library
- David I. August, Wen-mei W. Hwu, and Scott A. Mahlke. 1997. A Framework for Balancing Control Flow and Predication. In Proceedings of 30th Annual International Symposium on Microarchitecture (Micro ’97). IEEE Computer Society, USA. 92–103. https://doi.org/10.1109/MICRO.1997.645801 Google ScholarCross Ref
- Christopher Barton, Arie Tal, Bob Blainey, and José Nelson Amaral. 2005. Generalized Index-Set Splitting. In Compiler Construction, Rastislav Bodik (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg. 106–120. isbn:978-3-540-31985-6 Google Scholar
- Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2008. A Compiler Framework for Optimization of Affine Loop Nests for Gpgpus. In Proceedings of the 22nd Annual International Conference on Supercomputing (ICS ’08). Association for Computing Machinery, New York, NY, USA. 225–234. isbn:9781605581583 https://doi.org/10.1145/1375527.1375562 Google ScholarDigital Library
- Yishen Chen, Charith Mendis, and Saman Amarasinghe. 2022. All You Need is Superword-Level Parallelism: Systematic Control-Flow Vectorization with SLP. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2022). Association for Computing Machinery, New York, NY, USA. 301–315. isbn:9781450392655 https://doi.org/10.1145/3519939.3523701 Google ScholarDigital Library
- Shuhan Ding and Soner Önder. 2010. Unrestricted Code Motion: A Program Representation and Transformation Algorithms Based on Future Values. In Compiler Construction, Rajiv Gupta (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg. 26–45. isbn:978-3-642-11970-5 Google Scholar
- Kemal Ebcioğlu. 1987. A Compilation Technique for Software Pipelining of Loops with Conditional Jumps. In Proceedings of the 20th Annual Workshop on Microprogramming (Micro 20). Association for Computing Machinery, New York, NY, USA. 69–79. isbn:0897912500 https://doi.org/10.1145/255305.255317 Google ScholarDigital Library
- Alexandre E. Eichenberger, Kathryn O’Brien, Kevin O’Brien, Peng Wu, Tong Chen, Peter H. Oden, Daniel A. Prener, Janice C. Shepherd, Byoungro So, Zehra Sura, Amy Wang, Tao Zhang, Peng Zhao, and Michael Gschwind. 2005. Optimizing Compiler for the CELL Processor. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT ’05). IEEE Computer Society, USA. 161–172. isbn:076952429X https://doi.org/10.1109/PACT.2005.33 Google ScholarDigital Library
- Alexander Jordan, Nikolai Kim, and Andreas Krall. 2013. IR-Level versus Machine-Level If-Conversion for Predicated Architectures. In Proceedings of the 10th Workshop on Optimizations for DSP and Embedded Systems (ODES ’13). Association for Computing Machinery, New York, NY, USA. 3–10. isbn:9781450319058 https://doi.org/10.1145/2443608.2443611 Google ScholarDigital Library
- Hyesoon Kim, Onur Mutlu, Jared Stark, and Yale Patt. 2006. Wish Branches: Enabling Adaptive and Aggressive Predicated Execution. IEEE Micro, 26 (2006), 48–58. https://api.semanticscholar.org/CorpusID:6838785 Google ScholarDigital Library
- JinYing Kong, Lin Han, JinLong Xu, and Kai Nie. 2022. Research on control flow conversion technique based on Domestic Sunway compiler. In 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP). IEEE Computer Society, Xi’an, China. 1340–1344. https://doi.org/10.1109/ICSP54964.2022.9778356 Google ScholarCross Ref
- Samuel Larsen and Saman Amarasinghe. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (PLDI ’00). Association for Computing Machinery, New York, NY, USA. 145–156. isbn:1581131992 https://doi.org/10.1145/349299.349320 Google ScholarDigital Library
- Tanya M. Lattner. 2005. An Implementation of Swing Modulo Scheduling with Extensions for Superblocks. Master’s thesis. Computer Science Dept., University of Illinois at Urbana-Champaign. Urbana, IL. See http://llvm.cs.uiuc.edu. Google Scholar
- LLVM. 2023. Auto-Vectorization in LLVM. https://llvm.org/docs/Vectorizers.html Google Scholar
- LLVM. 2023. Vectorization Plan. https://llvm.org/docs/VectorizationPlan.html Google Scholar
- Dragan Milicev and Zoran Jovanovic. 2002. Control Flow Regeneration for Software Pipelined Loops with Conditions. International Journal of Parallel Programming, 30 (2002), 06, 149–179. https://doi.org/10.1023/A:1015453520790 Google ScholarDigital Library
- Simon Moll. 2020. Vector Predication Roadmap. https://llvm.org/docs/Proposals/VectorPredication.html Google Scholar
- Simon Moll and Sebastian Hack. 2018. Partial Control-Flow Linearization. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). Association for Computing Machinery, New York, NY, USA. 543–556. isbn:9781450356985 https://doi.org/10.1145/3192366.3192413 Google ScholarDigital Library
- Simon Moll, Shrey Sharma, Matthias Kurtenacker, and Sebastian Hack. 2019. Multi-Dimensional Vectorization in LLVM. In Proceedings of the 5th Workshop on Programming Models for SIMD/Vector Processing (WPMVP’19). Association for Computing Machinery, New York, NY, USA. Article 3, 8 pages. isbn:9781450362917 https://doi.org/10.1145/3303117.3306172 Google ScholarDigital Library
- Jaime H. Moreno, Victor V. Zyuban, Uzi Shvadron, Fredy D. Neeser, Jeff H. Derby, Malcolm S. Ware, Krishnan Kailas, Ayal Zaks, Amir B. Geva, Shay Ben-David, Sameh W. Asaad, Thomas W. Fox, Daniel Littrell, Marina Biberstein, Dorit Naishlos, and Hillery C. Hunter. 2003. An innovative low-power high-performance programmable signal processor for digital communications. IBM J. Res. Dev., 47, 2-3 (2003), 299–326. https://doi.org/10.1147/RD.472.0299 Google ScholarDigital Library
- Todd C. Mowry, Monica S. Lam, and Anoop Gupta. 1992. Design and Evaluation of a Compiler Algorithm for Prefetching. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS V). Association for Computing Machinery, New York, NY, USA. 62–73. isbn:0897915348 https://doi.org/10.1145/143365.143488 Google ScholarDigital Library
- Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006. Auto-Vectorization of Interleaved Data for SIMD. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’06). Association for Computing Machinery, New York, NY, USA. 132–143. isbn:1595933204 https://doi.org/10.1145/1133981.1133997 Google ScholarDigital Library
- Vasileios Porpodas and Pushkar Ratnalikar. 2021. PostSLP: Cross-Region Vectorization of Fully or Partially Vectorized Code. In Languages and Compilers for Parallel Computing, Santosh Pande and Vivek Sarkar (Eds.). Springer International Publishing, Cham. 15–31. isbn:978-3-030-72789-5 Google Scholar
- Rodrigo C. O. Rocha, Vasileios Porpodas, Pavlos Petoumenos, Luís F. W. Góes, Zheng Wang, Murray Cole, and Hugh Leather. 2020. Vectorization-Aware Loop Unrolling with Seed Forwarding. In Proceedings of the 29th International Conference on Compiler Construction (CC 2020). Association for Computing Machinery, New York, NY, USA. 1–13. isbn:9781450371209 https://doi.org/10.1145/3377555.3377890 Google ScholarDigital Library
- Charitha Saumya, Kirshanthan Sundararajah, and Milind Kulkarni. 2022. DARM: Control-Flow Melding for SIMT Thread Divergence Reduction. In 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 1–13. https://doi.org/10.1109/CGO53902.2022.9741285 Google ScholarDigital Library
- Fabian Schuiki, Florian Zaruba, Torsten Hoefler, and Luca Benini. 2021. Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores. IEEE Trans. Comput., 70, 2 (2021), feb, 212–227. issn:0018-9340 https://doi.org/10.1109/TC.2020.2987314 Google ScholarDigital Library
- Jaewook Shin, Mary Hall, and Jacqueline Chame. 2005. Superword-Level Parallelism in the Presence of Control Flow. In Proceedings of the International Symposium on Code Generation and Optimization (CGO ’05). IEEE Computer Society, USA. 165–175. isbn:076952298X https://doi.org/10.1109/CGO.2005.33 Google ScholarDigital Library
- James E. Smith. 1982. Decoupled Access/Execute Computer Architectures. In Proceedings of the 9th Annual Symposium on Computer Architecture (ISCA ’82). IEEE Computer Society Press, Washington, DC, USA. 112–119. Google ScholarDigital Library
- TI. 2023. C7000 C/C++ Optimization Guide. www.ti.com Google Scholar
- Gang-Ryung Uh, Yuhong Wang, Sanjay Jinturkar, Chris Burns, and Vincent Cao. 2000. Techniques for Effectively Exploiting a Zero Overhead Loop Buffer. In Proceedings of the 9th International Conference on Compiler Construction. 157–172. isbn:978-3-540-67263-0 https://doi.org/10.1007/3-540-46423-9_11 Google ScholarCross Ref
- Janek van Oirschot. 2022. Hardware Loops in the IPU Backend. https://llvm.org/devmtg/2022-05/slides/ Google Scholar
- Nicolas Vasilache, Cédric Bastoul, and Albert Cohen. 2006. Polyhedral Code Generation in the Real World. In Proceedings of the 15th International Conference on Compiler Construction (CC’06). Springer-Verlag, Berlin, Heidelberg. 185–201. isbn:354033050X https://doi.org/10.1007/11688839_16 Google ScholarDigital Library
- Miao Wang, Rongcai Zhao, Jianmin Pang, and Guoming Cai. 2008. Reconstructing Control Flow in Modulo Scheduled Loops. In Seventh IEEE/ACIS International Conference on Computer and Information Science (ICIS 2008). IEEE, Portland, OR. 539–544. isbn:978-0-7695-3131-1 https://doi.org/10.1109/ICIS.2008.16 Google ScholarDigital Library
- Zhengrong Wang and Tony Nowatzki. 2019. Stream-Based Memory Access Specialization for General Purpose Processors. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA ’19). Association for Computing Machinery, New York, NY, USA. 736–749. isbn:9781450366694 https://doi.org/10.1145/3307650.3322229 Google ScholarDigital Library
- Nancy J. Warter, Scott A. Mahlke, Wen-Mei W. Hwu, and B. Ramakrishna Rau. 1993. Reverse If-Conversion. In Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation (PLDI ’93). Association for Computing Machinery, New York, NY, USA. 290–299. isbn:0897915984 https://doi.org/10.1145/155090.155118 Google ScholarDigital Library
- Baofen Yuan, Jianfeng Zhu, Xingchen Man, Zijiao Ma, Shouyi Yin, Shaojun Wei, and Leibo Liu. 2022. Dynamic-II Pipeline: Compiling Loops With Irregular Branches on Static-Scheduling CGRA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 41, 9 (2022), 2929–2942. https://doi.org/10.1109/TCAD.2021.3121346 Google ScholarCross Ref
- Han-saem Yun, Jihong Kim, and Soo-mook Moon. 2001. A First Step Towards Time Optimal Software Pipelining of Loops with Control Flows. In Proceedings of the 10th International Conference on Compiler Construction. Springer-Verlag, Berlin, Heidelberg, Genove, Italy. isbn:978-3-540-41861-0 https://doi.org/10.1007/3-540-45306-7_13 Google ScholarCross Ref
- Han-Saem Yun, Jihong Kim, and Soo-Mook Moon. 2002. Optimal Software Pipelining of Loops with Control Flows. In Proceedings of the 16th International Conference on Supercomputing (ICS ’02). Association for Computing Machinery, New York, NY, USA. 117–128. isbn:1581134835 https://doi.org/10.1145/514191.514210 Google ScholarDigital Library
- Eric Zimmerman. 2005. Profile-directed If-Conversion in Superscalar Microprocessors. Master’s thesis. Computer Science Dept., University of Illinois at Urbana-Champaign. https://llvm.org/pubs/2005-07-ZimmermanMSThesis.html Google Scholar
Index Terms
- If-Convert as Early as You Must
Recommendations
Vectorizing programs with IF-statements for processors with SIMD extensions
AbstractVectorization of programs is crucial for achieving high performance on modern processors with SIMD (Single Instruction Multiple Data) extensions. Programs with IF-statements suffer from control flow divergence that seriously complicates automatic ...
The effects of predicated execution on branch prediction
MICRO 27: Proceedings of the 27th annual international symposium on MicroarchitectureHigh performance architectures have always had to deal with the performance limiting impact of branch operations. Microprocessor designs are going to have to deal with this problem as well, as they move towards deeper pipelines and support for multiple ...
Control-Flow Decoupling
MICRO-45: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on MicroarchitectureMobile and PC/server class processor companies continue to roll out flagship core micro architectures that are faster than their predecessors. Meanwhile placing more cores on a chip coupled with constant supply voltage puts per-core energy consumption ...
Comments