skip to main content
research-article

Pointer-Based Divergence Analysis for OpenCL 2.0 Programs

Published: 15 October 2021 Publication History

Abstract

A modern GPU is designed with many large thread groups to achieve a high throughput and performance. Within these groups, the threads are grouped into fixed-size SIMD batches in which the same instruction is applied to vectors of data in a lockstep. This GPU architecture is suitable for applications with a high degree of data parallelism, but its performance degrades seriously when divergence occurs. Many optimizations for divergence have been proposed, and they vary with the divergence information about variables and branches. A previous analysis scheme viewed pointers and return values from functions as divergence directly, and only focused on OpenCL 1.x. In this article, we present a novel scheme that reports the divergence information for pointer-intensive OpenCL programs. The approach is based on extended static single assignment (SSA) and adds some special functions and annotations from memory SSA and gated SSA. The proposed scheme first constructs extended SSA, which is then used to build a divergence relation graph that includes all of the possible points-to relationships of the pointers and initialized divergence states. The divergence state of the pointers can be determined by propagating the divergence state of the divergence relation graph. The scheme is further extended for interprocedural cases by considering function-related statements. The proposed scheme was implemented in an LLVM compiler and can be applied to OpenCL programs. We analyzed 10 programs with 24 kernels, with a total analyzed program size of 1,306 instructions in an LLVM intermediate representation, with 885 variables, 108 branches, and 313 pointer-related statements. The total number of divergent pointers detected was 146 for the proposed scheme, 200 for the scheme in which the pointer was always divergent, and 155 for the current LLVM default scheme; the total numbers of divergent variables detected were 458, 519, and 482, respectively, with 31, 34, and 32 divergent branches. These experimental results indicate that the proposed scheme is more precise than both a scheme in which a pointer is always divergent and the current LLVM default scheme.

References

[1]
Jayvant Anantpur and R. Govindarajan.2014. Taming control divergence in GPUs through control flow linearization. In Compiler Construction, Albert Cohen (Ed.). Springer, Berlin, Germany, 133–153.
[2]
Snaider Carrillo, Jakob Siegel, and Xiaoming Li. 2009. A control-structure splitting optimization for GPGPU. In Proceedings of the 6th ACM Conference on Computing Frontiers (CF’09). ACM, New York, NY, 147–150.
[3]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC’09). IEEE, Los Alamitos, CA, 44–54.
[4]
Jong-Deok Choi, Michael Burke, and Paul Carini. 1993. Efficient flow-sensitive interprocedural computation of pointer-induced aliases and side effects. In Proceedings of the 20th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 232–245.
[5]
Fred C. Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, and Mark Streich. 1996. Effective representation of aliases and indirect memory operations in SSA form. In Proceedings of the 6th International Conference on Compiler Construction (CC’96). 253–267.
[6]
Bruno Coutinho, Diogo Sampaio, Fernando Magno Quintao Pereira, and Wagner Meira Jr.2011. Divergence analysis and optimizations. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT’11). IEEE, Los Alamitos, CA, 320–329.
[7]
Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems 13, 4 (Oct. 1991), 451–490.
[8]
Douglas do Couto Teixeira, Sylvain Collange, and Fernando Magno Quintao Pereira. 2015. Fusion of calling sites. In Proceedings of the 2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’15). 90–97.
[9]
International Organization for Standardization. ISO/IEC 14882:2017 Programming Languages—C++. Retrieved August 3, 2020 from https://www.iso.org/standard/68564.html.
[10]
Wilson W. L. Fung and Tor M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA’11). IEEE, Los Alamitos, CA, 25–36. http://dl.acm.org/citation.cfm?id=2014698.2014893.
[11]
Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40). IEEE, Los Alamitos, CA, 407–420.
[12]
Khronos® OpenCL Working Group. The OpenCL™ C 3.0 Specification. Retrieved August 3, 2020 from https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_C.html.
[13]
Vinod Grover, Bastiaan Joannes, Matheus Aarts, and Michael Murphy. 2009. Variance analysis for translating CUDA code for execution by a general purpose processor. US Patent 8,984,498.
[14]
Tianyi David Han and Tarek S. Abdelrahman. 2011. Reducing branch divergence in GPU programs. In Proceedings of the 4th Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-4). ACM, New York, NY, Article 3, 8 pages.
[15]
Tianyi David Han and Tarek S. Abdelrahman. 2013. Reducing divergence in GPGPU programs with loop merging. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). ACM, New York, NY, 12–23.
[16]
Ming-Yu Hung, Peng-Sheng Chen, Yuan-Shin Hwang, Roy Dz-Ching Ju, and Jenq-Kuen Lee. 2012. Support of probabilistic pointer analysis in the SSA form. IEEE Transactions on Parallel and Distributed Systems 23, 12 (2012), 2366–2379.
[17]
Adel Johar and Anton Gorenko. n.d. GEGL-OpenCL: OpenCL in GIMP. Retrieved June 18, 2021 from https://opencl.org/projects/gegl-opencl-in-gimp/.
[18]
Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’11). IEEE, Los Alamitos, CA, 141–150.
[19]
Ralf Karrenberg and Sebastian Hack. 2012. Improving performance of OpenCL on CPUs. In Proceedings of the 21st International Conference on Compiler Construction (CC’12). 1–20.
[20]
Andrew Kerr, Gregory Diamos, and S. Yalamanchili. 2012. Dynamic compilation of data-parallel kernels for vector processors. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). ACM, New York, NY, 23–32.
[21]
Shorin Kyo. 2012. Selecting broadcast SIMD instruction or cached MIMD instruction stored in local memory of one of plurality of processing elements for all elements in each unit. US Patent 8,112,613.
[22]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization (CGO’04). IEEE, Los Alamitos, CA, 75.
[23]
Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. 2009. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’09). ACM, New York, NY, 101–110.
[24]
Yunsup Lee, Ronny Krashinsky, Vinod Grover, Stephen W. Keckler, and Krste Asanovic. 2013. Convergence and scalarization for data-parallel architectures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). IEEE, Los Alamitos, CA, 1–11.
[25]
Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 235–246.
[26]
Simon Moll and Sebastian Hack. 2018. Partial control-flow linearization. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’18). ACM, New York, NY, 543–556. https://doi.org/10.1145/3192366.3192413
[27]
Daniel Moth. 2012. C++—A code-based introduction to C++ AMP. MSDN Magazine-Louisville 27, 4 (April 2012), 28.
[28]
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (March 2008), 40–53.
[29]
Shohei Nomoto, Shorin Kyo, and Shinichiro Okazaki. 2011. A dynamic SIMD/MIMD mode switching processor for embedded real-time image recognition systems. In Proceedings of the 2011 IEEE Asian Solid-State Circuits Conference. IEEE, Los Alamitos, CA, 17–20.
[30]
Karl J. Ottenstein, Robert A. Ballance, and Arthur B. MacCabe. 1990. The program dependence web: A representation supporting control-, data-, and demand-driven interpretation of imperative languages. In Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation (PLDI’90). ACM, New York, NY, 257–271.
[31]
N. Reissmann, T. L. Falch, B. A. Bjrnseth, H. Bahmann, J. Christian Meyer, and M. Jahre. 2016. Efficient control flow restructuring for GPUs. In Proceedings of the 2016 International Conference on High Performance Computing Simulation (HPCS’16). 48–57.
[32]
B. K. Rosen, M. N. Wegman, and F. K. Zadeck. 1988. Global value numbers and redundant computations. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL’88). ACM, New York, NY, 12–27.
[33]
Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-Mei W. Hwu. 2008. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’08). ACM, New York, NY, 73–82.
[34]
Diogo Sampaio, Rafael Martins, Sylvain Collange, and Fernando Magno Quintao Pereira. 2012. Divergence analysis with affine constraints. In Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’12). IEEE, Los Alamitos, CA, 67–74.
[35]
Diogo Nunes Sampaio, Elie Gedeon, Fernando Magno Quintão Pereira, and Sylvain Collange. 2012. Spill code placement for SIMD machines. In Proceedings of the 16th Brazilian Conference on Programming Languages (SBLP’12). 12–26.
[36]
ISO Standard. 2014. Programming Languages—Technical Specification for C++ Extensions for Parallelism. Standard ISO/IEC TS. ISO http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4354.pdf.
[37]
John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science Engineering 12, 3 (May 2010), 66–73.
[38]
John A. Stratton, Vinod Grover, Jaydeep Marathe, Bastiaan Aarts, Mike Murphy, Ziang Hu, and Wen-Mei W. Hwu. 2010. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’10). ACM, New York, NY, 111–119.
[39]
Peng Tu and David Padua. 1995. Efficient building and placing of gating functions. In Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation (PLDI’95). ACM, New York, NY, 47–55.
[40]
Shao-Chung Wang, Li-Chen Kan, Chao-Lin Lee, Yuan-Shin Hwang, and Jenq-Kuen Lee. 2017. Architecture and compiler support for GPUs using energy-efficient affine register files. ACM Transactions on Design Automation of Electronic Systems 23, 2 (Nov. 2017), Article 18, 25 pages.
[41]
Sandra Wienke, Paul Springer, Christian Terboven, and Dieter an Mey. 2012. OpenACC: First experiences with real-world applications. In Proceedings of the 18th International Conference on Parallel Processing (Euro-Par’12). 859–870.
[42]
Haicheng Wu, Gregory Diamos, Jin Wang, Si Li, and Sudhakar Yalamanchili. 2012. Characterization and transformation of unstructured control flow in bulk synchronous GPU applications. International Journal of High Performance Computing Applications 26, 2 (May 2012), 170–185.
[43]
Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, and Xipeng Shen. 2010. Streamlining GPU applications on the fly: Thread divergence elimination through runtime thread-data remapping. In Proceedings of the 24th ACM International Conference on Supercomputing (ICS’10). ACM, New York, NY, 115–126.
[44]
Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly elimination of dynamic irregularities for GPU computing. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVI). ACM, New York, NY, 369–380.

Cited By

View all
  • (2025)Support of MISRA C++ Analyzer for Reliability of Embedded SystemsACM Transactions on Cyber-Physical Systems10.1145/36113909:1(1-27)Online publication date: 22-Jan-2025
  • (2024)DTAP: Accelerating Strongly-Typed Programs with Data Type-Aware Hardware PrefetchingACM Transactions on Architecture and Code Optimization10.1145/3701994Online publication date: 29-Oct-2024
  • (2024)The Rewriting of DataRaceBench Benchmark for OpenCL Program ValidationsWorkshop Proceedings of the 53rd International Conference on Parallel Processing10.1145/3677333.3678148(15-22)Online publication date: 12-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing
ACM Transactions on Parallel Computing  Volume 8, Issue 4
December 2021
118 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/3481693
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2021
Accepted: 01 April 2021
Revised: 01 April 2021
Received: 01 March 2020
Published in TOPC Volume 8, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Compiler
  2. pointer analysis
  3. divergence analysis
  4. graphics processing units

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • MediaTek
  • MOST of Taiwan

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)56
  • Downloads (Last 6 weeks)8
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Support of MISRA C++ Analyzer for Reliability of Embedded SystemsACM Transactions on Cyber-Physical Systems10.1145/36113909:1(1-27)Online publication date: 22-Jan-2025
  • (2024)DTAP: Accelerating Strongly-Typed Programs with Data Type-Aware Hardware PrefetchingACM Transactions on Architecture and Code Optimization10.1145/3701994Online publication date: 29-Oct-2024
  • (2024)The Rewriting of DataRaceBench Benchmark for OpenCL Program ValidationsWorkshop Proceedings of the 53rd International Conference on Parallel Processing10.1145/3677333.3678148(15-22)Online publication date: 12-Aug-2024
  • (2024)Case Study: Optimization Methods With TVM Hybrid-OP on RISC-V Packed SIMDIEEE Access10.1109/ACCESS.2024.339719512(64193-64211)Online publication date: 2024
  • (2023)Pointer Analysis for Programs on Hybrid DRAM-PM Memory SystemsProceedings of the 52nd International Conference on Parallel Processing Workshops10.1145/3605731.3605906(96-103)Online publication date: 7-Aug-2023
  • (2022)The Support of MLIR HLS Adaptor for LLVM IRWorkshop Proceedings of the 51st International Conference on Parallel Processing10.1145/3547276.3548515(1-8)Online publication date: 29-Aug-2022
  • (2022)Register-Pressure Aware Predicator for Length Multiplier of RVVWorkshop Proceedings of the 51st International Conference on Parallel Processing10.1145/3547276.3548513(1-9)Online publication date: 29-Aug-2022
  • (2022)Preserving Addressability Upon GC-Triggered Data Movements on Non-Volatile MemoryACM Transactions on Architecture and Code Optimization10.1145/351170619:2(1-26)Online publication date: 24-Mar-2022

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media