skip to main content
10.1145/2723742.2723760acmotherconferencesArticle/Chapter ViewAbstractPublication PagesisecConference Proceedingsconference-collections
research-article

A Profile Guided Approach to Optimize Branch Divergence While Transforming Applications for GPUs

Published: 18 February 2015 Publication History

Abstract

GPUs offer a powerful bulk synchronous programming model for exploiting data parallelism; however, branch divergence amongst executing warps can lead to serious performance degradation due to execution serialization. We propose a novel profile guided approach to optimize branch divergence while transforming a serial program to a data-parallel program for GPUs. Our approach is based on the observation that branches inside some data parallel loops although divergent, exhibit repetitive regular patterns of outcomes. By exploiting such patterns, loop iterations can be aligned so that the corresponding iterations traverse the same branch path. These aligned iterations when executed as a warp in a GPU, become convergent. We propose a new metric based on the repetitive pattern characteristics that indicates whether a data-parallel loop is worth restructuring. When tested our approach on the well-known Rodinia benchmark, we found that it is possible to achieve upto 48% performance improvement by loop restructuring suggested by the patterns and our metrics.

References

[1]
NVIDIA CUDA C Programming Guide Version 3.2. NVIDIA, 2010.
[2]
K. Asanovic, R. Bodik, et al. A view of the parallel computing landscape. Communications of ACM, 52:56--67, October 2009.
[3]
H. Bae, L. Bachega, et al. Cetus: A Source-to-Source Compiler Infrastructure for Multicores. In 14th Intl. Workshop on Compilers for Parallel Computing, (CPC), 2009.
[4]
M. Baskaran, U. Bondhugula, et al. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In ACM Symp. on Principles and Practice of Parallel Programming (PPoPP), 2008.
[5]
M. Baskaran, U. Bondhugula, et al. A compiler framework for optimization of affine loop nests for gpgpus. In ACM/IEEE Conf. on Supercomputing, 2008.
[6]
S. Carrillo, J. Siegel, and X. Li. A Control-Structure Splitting Optimization for GPGPU. In Proc. of ACM Computing Frontiers, 2009.
[7]
S. Che, M. Boyer, et al. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IEEE Intl. Symp. on Workload Characterization (IISWC), 2009.
[8]
B. Coutinho, D. Sampaio, et al. Divergence Analysis and Optimizations. In IEEE Intl. Conf. on Parallel Architectures and Compilation Techniques, 2011.
[9]
W. Fung. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM Transactions on Architecture Code Optimization, 6(2):1--37, 2009.
[10]
W. Fung, I. Sham, et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control. In Proc. of 40th IEEE/ACM Intl. Symp. on Microarchitecture, 2007.
[11]
P. N. Glaskowsky. NVidia fermi: The first complete GPU computing architecture. Technical report, 2009.
[12]
T. D. Han and T. S. Abdelrahman. Reducing Branch Divergence in GPU Programs. In ACM GPGPU-4, 2011.
[13]
U. J. Kapasi, J. Dally, et al. Efficient Conditional Operations for Data-Parallel Architectures. In Proc. of 33rd ACM/IEEE Intl. Symp. on Microarchitecture, 2000.
[14]
H. Kim, J. A. Joao, et al. Profile-assisted Compiler Support for Dynamic Predication in Diverge-Merge Processors. In Proc. of IEEE Intl. Symp. on Code Generation and Optimization (CGO), 2007.
[15]
D. Kirk and W.-M. W. Hwu. Programming Massively Parallel Processors. Morgan Kaufmann, 2010.
[16]
R. Kolpakov et al. mreps: Efficient and Flexible Detection of Tandem Repeats in DNA. Nucleic Acids Research, 31(13):3672--3678, 2003.
[17]
R. Kolpakov and G. Kucherov. Finding Maximal Repetitions in a Word in Linear Time. In Proc. of the 40th Annual Symposium on Foundations of Computer Science, 1999.
[18]
C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proc. Intl. Symp. on Code generation and optimization, pages 75--, 2004.
[19]
S. Lee and R. Eigenmann. OpenMPC: Extended OpenMP Programming and Tuning for GPUs. In ACM/IEEE Conf. on Supercomputing, 2010.
[20]
S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 101--110, 2009.
[21]
J. Meng, D. Tarjan, and K. Skadron. Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. In ISCA'10, pages 235--246. ACM, 2010.
[22]
J. Nickolls and W. J. Dally. The GPU computing era. IEEE Micro, 30(2), 2010.
[23]
D. Patterson. The trouble with multi-core. IEEE Spectrum, 2010.
[24]
L.-N. Pouchet, U. Bondhugula, et al. Loop Transformations: Convexity, Pruning and Optimization. In POPL'11, pages 549--561. ACM, 2011.
[25]
S. Ryoo, C. I. Rodriguesy, et al. Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA. In ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPoPP). ACM, 2008.
[26]
S. Sarkar, S. Mitra, and A. Srinivasan. Reuse and refactoring of gpu kernels to design complex applications. In Intl. Symp. on Parallel and Distributed Processing with Applications, pages 134--141, 2012.
[27]
J. A. Stratton, V. Grover, et al. Efficient compilation of fine-grained spmd-threaded programs for multicore cpus. In International Symp. on Code Generation and Optimization (CGO), pages 111--119, 2010.
[28]
L. G. Valiant. A Bridging Model for Parallel Computation. Communications of ACM, 33(8):103--111, 1990.
[29]
Y. Yang, P. Xiang, et al. A GPGPU compiler for memory optimization and parallelism management. In ACM SIGPLAN Conf. on Programming Language Design and Implementation, 2010.
[30]
E. Z. Zhang, Y. Jiang, et al. Streamlining GPU Applications On the Fly-Thread Divergence Elimination through Runtime Thread-Data Remapping. In ICS'10, pages 115--125. ACM, 2010.

Cited By

View all
  • (2021)Contrived and Remediated GPU Thread Divergence Using a Flattening TechniqueAdvances in Parallel & Distributed Processing, and Applications10.1007/978-3-030-69984-0_46(647-658)Online publication date: 19-Oct-2021
  • (2017)Merge or Separate?Proceedings of the General Purpose GPUs10.1145/3038228.3038235(22-31)Online publication date: 4-Feb-2017
  • (2016)Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2016.36(524-533)Online publication date: May-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ISEC '15: Proceedings of the 8th India Software Engineering Conference
February 2015
207 pages
ISBN:9781450334327
DOI:10.1145/2723742
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • iSOFT: iSOFT
  • ACM India: ACM India

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 February 2015

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ISEC '15
ISEC '15: 8th India Software Engineering Conference
February 18 - 20, 2015
Bangalore, India

Acceptance Rates

Overall Acceptance Rate 76 of 315 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)6
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Contrived and Remediated GPU Thread Divergence Using a Flattening TechniqueAdvances in Parallel & Distributed Processing, and Applications10.1007/978-3-030-69984-0_46(647-658)Online publication date: 19-Oct-2021
  • (2017)Merge or Separate?Proceedings of the General Purpose GPUs10.1145/3038228.3038235(22-31)Online publication date: 4-Feb-2017
  • (2016)Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2016.36(524-533)Online publication date: May-2016
  • (2016)An Automated Analysis of the Branch Coverage and Energy Consumption Using Concolic TestingArabian Journal for Science and Engineering10.1007/s13369-016-2284-242:2(619-637)Online publication date: 27-Aug-2016
  • (2015)Efficient warp execution in presence of divergence with collaborative context collectionProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830796(204-215)Online publication date: 5-Dec-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media