research-article

A Profile Guided Approach to Optimize Branch Divergence While Transforming Applications for GPUs

Authors:

Santonu Sarkar,

Sayantan MitraAuthors Info & Claims

ISEC '15: Proceedings of the 8th India Software Engineering Conference

Pages 176 - 185

https://doi.org/10.1145/2723742.2723760

Published: 18 February 2015 Publication History

Abstract

GPUs offer a powerful bulk synchronous programming model for exploiting data parallelism; however, branch divergence amongst executing warps can lead to serious performance degradation due to execution serialization. We propose a novel profile guided approach to optimize branch divergence while transforming a serial program to a data-parallel program for GPUs. Our approach is based on the observation that branches inside some data parallel loops although divergent, exhibit repetitive regular patterns of outcomes. By exploiting such patterns, loop iterations can be aligned so that the corresponding iterations traverse the same branch path. These aligned iterations when executed as a warp in a GPU, become convergent. We propose a new metric based on the repetitive pattern characteristics that indicates whether a data-parallel loop is worth restructuring. When tested our approach on the well-known Rodinia benchmark, we found that it is possible to achieve upto 48% performance improvement by loop restructuring suggested by the patterns and our metrics.

References

[1]

NVIDIA CUDA C Programming Guide Version 3.2. NVIDIA, 2010.

[2]

K. Asanovic, R. Bodik, et al. A view of the parallel computing landscape. Communications of ACM, 52:56--67, October 2009.

Digital Library

[3]

H. Bae, L. Bachega, et al. Cetus: A Source-to-Source Compiler Infrastructure for Multicores. In 14th Intl. Workshop on Compilers for Parallel Computing, (CPC), 2009.

[4]

M. Baskaran, U. Bondhugula, et al. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In ACM Symp. on Principles and Practice of Parallel Programming (PPoPP), 2008.

Digital Library

[5]

M. Baskaran, U. Bondhugula, et al. A compiler framework for optimization of affine loop nests for gpgpus. In ACM/IEEE Conf. on Supercomputing, 2008.

Digital Library

[6]

S. Carrillo, J. Siegel, and X. Li. A Control-Structure Splitting Optimization for GPGPU. In Proc. of ACM Computing Frontiers, 2009.

Digital Library

[7]

S. Che, M. Boyer, et al. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IEEE Intl. Symp. on Workload Characterization (IISWC), 2009.

Digital Library

[8]

B. Coutinho, D. Sampaio, et al. Divergence Analysis and Optimizations. In IEEE Intl. Conf. on Parallel Architectures and Compilation Techniques, 2011.

Digital Library

[9]

W. Fung. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM Transactions on Architecture Code Optimization, 6(2):1--37, 2009.

Digital Library

[10]

W. Fung, I. Sham, et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control. In Proc. of 40th IEEE/ACM Intl. Symp. on Microarchitecture, 2007.

Digital Library

[11]

P. N. Glaskowsky. NVidia fermi: The first complete GPU computing architecture. Technical report, 2009.

[12]

T. D. Han and T. S. Abdelrahman. Reducing Branch Divergence in GPU Programs. In ACM GPGPU-4, 2011.

Digital Library

[13]

U. J. Kapasi, J. Dally, et al. Efficient Conditional Operations for Data-Parallel Architectures. In Proc. of 33rd ACM/IEEE Intl. Symp. on Microarchitecture, 2000.

Digital Library

[14]

H. Kim, J. A. Joao, et al. Profile-assisted Compiler Support for Dynamic Predication in Diverge-Merge Processors. In Proc. of IEEE Intl. Symp. on Code Generation and Optimization (CGO), 2007.

Digital Library

[15]

D. Kirk and W.-M. W. Hwu. Programming Massively Parallel Processors. Morgan Kaufmann, 2010.

Digital Library

[16]

R. Kolpakov et al. mreps: Efficient and Flexible Detection of Tandem Repeats in DNA. Nucleic Acids Research, 31(13):3672--3678, 2003.

[17]

R. Kolpakov and G. Kucherov. Finding Maximal Repetitions in a Word in Linear Time. In Proc. of the 40th Annual Symposium on Foundations of Computer Science, 1999.

Digital Library

[18]

C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proc. Intl. Symp. on Code generation and optimization, pages 75--, 2004.

Digital Library

[19]

S. Lee and R. Eigenmann. OpenMPC: Extended OpenMP Programming and Tuning for GPUs. In ACM/IEEE Conf. on Supercomputing, 2010.

Digital Library

[20]

S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 101--110, 2009.

Digital Library

[21]

J. Meng, D. Tarjan, and K. Skadron. Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. In ISCA'10, pages 235--246. ACM, 2010.

Digital Library

[22]

J. Nickolls and W. J. Dally. The GPU computing era. IEEE Micro, 30(2), 2010.

Digital Library

[23]

D. Patterson. The trouble with multi-core. IEEE Spectrum, 2010.

Digital Library

[24]

L.-N. Pouchet, U. Bondhugula, et al. Loop Transformations: Convexity, Pruning and Optimization. In POPL'11, pages 549--561. ACM, 2011.

Digital Library

[25]

S. Ryoo, C. I. Rodriguesy, et al. Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA. In ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPoPP). ACM, 2008.

Digital Library

[26]

S. Sarkar, S. Mitra, and A. Srinivasan. Reuse and refactoring of gpu kernels to design complex applications. In Intl. Symp. on Parallel and Distributed Processing with Applications, pages 134--141, 2012.

Digital Library

[27]

J. A. Stratton, V. Grover, et al. Efficient compilation of fine-grained spmd-threaded programs for multicore cpus. In International Symp. on Code Generation and Optimization (CGO), pages 111--119, 2010.

Digital Library

[28]

L. G. Valiant. A Bridging Model for Parallel Computation. Communications of ACM, 33(8):103--111, 1990.

Digital Library

[29]

Y. Yang, P. Xiang, et al. A GPGPU compiler for memory optimization and parallelism management. In ACM SIGPLAN Conf. on Programming Language Design and Implementation, 2010.

Digital Library

[30]

E. Z. Zhang, Y. Jiang, et al. Streamlining GPU Applications On the Fly-Thread Divergence Elimination through Runtime Thread-Data Remapping. In ICS'10, pages 115--125. ACM, 2010.

Digital Library

Cited By

Vespa LPeters G(2021)Contrived and Remediated GPU Thread Divergence Using a Flattening TechniqueAdvances in Parallel & Distributed Processing, and Applications10.1007/978-3-030-69984-0_46(647-658)Online publication date: 19-Oct-2021
https://doi.org/10.1007/978-3-030-69984-0_46
Wen YO'Boyle M(2017)Merge or Separate?Proceedings of the General Purpose GPUs10.1145/3038228.3038235(22-31)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.1145/3038228.3038235
Khorasani FRowe BGupta RBhuyan L(2016)Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2016.36(524-533)Online publication date: May-2016
https://doi.org/10.1109/IPDPS.2016.36
Show More Cited By

Index Terms

Recommendations

On-GPU Thread-Data Remapping for Branch Divergence Reduction

General Purpose GPU computing (GPGPU) plays an increasingly vital role in high performance computing and other areas like deep learning. However, arising from the SIMD execution model, the branch divergence issue lowers efficiency of conditional ...
Reducing branch divergence in GPU programs
GPGPU-4: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units

Branch divergence has a significant impact on the performance of GPU programs. We propose two novel software-based optimizations, called iteration delaying and branch distribution that aim to reduce branch divergence. Iteration delaying targets a ...
Loop Optimization for Divergence Reduction on GPUs with SIMT Architecture
The single-instruction multiple thread (SIMT) architecture that can be found in some latest graphical processing units (GPUs) builds on the conventional single-instruction multiple data (SIMD) parallelism while adopting the thread programming model. The ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ISEC '15: Proceedings of the 8th India Software Engineering Conference

February 2015

207 pages

ISBN:9781450334327

DOI:10.1145/2723742

General Chairs:
Srinivas Padmanabhuni
Infosys Labs
,
Raghu Nambiar
Siemens
,
Program Chairs:
Prem Devanbu
University of California, Davis
,
Murali Krishna Ramanathan
IISc, Bangalore
,
Publications Chair:
Ashish Sureka
IIIT Delhi

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

iSOFT: iSOFT
ACM India: ACM India

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 February 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Science and Engineering Research Board

Conference

ISEC '15

ISEC '15: 8th India Software Engineering Conference

February 18 - 20, 2015

Bangalore, India

Acceptance Rates

Overall Acceptance Rate 76 of 315 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
154
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)6

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Vespa LPeters G(2021)Contrived and Remediated GPU Thread Divergence Using a Flattening TechniqueAdvances in Parallel & Distributed Processing, and Applications10.1007/978-3-030-69984-0_46(647-658)Online publication date: 19-Oct-2021
https://doi.org/10.1007/978-3-030-69984-0_46
Wen YO'Boyle M(2017)Merge or Separate?Proceedings of the General Purpose GPUs10.1145/3038228.3038235(22-31)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.1145/3038228.3038235
Khorasani FRowe BGupta RBhuyan L(2016)Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2016.36(524-533)Online publication date: May-2016
https://doi.org/10.1109/IPDPS.2016.36
Godboley SPanda SDutta AMohapatra D(2016)An Automated Analysis of the Branch Coverage and Energy Consumption Using Concolic TestingArabian Journal for Science and Engineering10.1007/s13369-016-2284-242:2(619-637)Online publication date: 27-Aug-2016
https://doi.org/10.1007/s13369-016-2284-2
Khorasani FGupta RBhuyan LPrvulovic M(2015)Efficient warp execution in presence of divergence with collaborative context collectionProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830796(204-215)Online publication date: 5-Dec-2015
https://dl.acm.org/doi/10.1145/2830772.2830796

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten