research-article

GOPipe: A Granularity-Oblivious Programming Framework for Pipelined Stencil Executions on GPU

Authors:
Chanyoung Oh

University of Seoul, Seoul, South Korea

University of Seoul, Seoul, South Korea
View Profile

,
Zhen Zheng

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China
View Profile

,
Xipeng Shen

North Carolina State University, Raleigh, NC, USA

North Carolina State University, Raleigh, NC, USA
View Profile

,
Jidong Zhai

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Youngmin Yi

University of Seoul, Seoul, South Korea

University of Seoul, Seoul, South Korea
View Profile

PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation TechniquesSeptember 2020Pages 43–54https://doi.org/10.1145/3410463.3414656

Published:30 September 2020Publication History

PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

Pages 43–54

ABSTRACT

Recent studies have shown promising performance benefits when multiple stages of a pipelined stencil application are mapped to different parts of a GPU to run concurrently. An important factor for the computing efficiency of such pipelines is the granularity of a task. In previous programming frameworks that support true pipelined computations on GPU, the choice has to be made by the programmers during the application development time. Due to many difficulties, programmers' decisions are often far from optimal, causing inferior performance and performance portability.

This paper presents GOPipe, a granularity-oblivious programming framework for efficient pipelined stencil executions on GPU. With GOPipe, programmers no longer need to specify the appropriate task granularity. GOPipe automatically finds it, and dynamically schedules tasks of that granularity for efficiency while observing all inter-task and inter-stage data dependencies. In our experiments on six real-life applications and various scenarios, GOPipe outperforms the state-of-the-art system by 1.39X on average with a much better programming productivity.

References

Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. 2019. Learning to optimize halide with tree search and random programs. ACM Transactions on Graphics (TOG), Vol. 38, 4 (2019), 121.Google ScholarDigital Library
Edward H Adelson, Charles H Anderson, James R Bergen, Peter J Burt, and Joan M Ogden. 1984. Pyramid methods in image processing. RCA engineer, Vol. 29, 6 (1984), 33--41.Google Scholar
Timo Aila and Samuli Laine. 2009. Understanding the efficiency of ray traversal on GPUs. In Conference on High PERFORMANCE Graphics. 145--149.Google ScholarDigital Library
Prithayan Barua, Jun Shirako, and Vivek Sarkar. 2018. Cost-driven thread coarsening for GPU kernels. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques. ACM, 32.Google ScholarDigital Library
Muthu Manikandan Baskaran, Nagavijayalakshmi Vydyanathan, Uday Kumar Reddy Bondhugula, Jagannathan Ramanujam, Atanas Rountev, and Ponnuswamy Sadayappan. 2009. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. ACM sigplan notices, Vol. 44, 4 (2009), 219--228.Google Scholar
Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing locality and independence with logical regions. In SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1--11.Google ScholarDigital Library
Mehmet E Belviranli, Seyong Lee, Jeffrey S Vetter, and Laxmi N Bhuyan. 2018. Juggler: a dependence-aware task-based execution framework for GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 54--67.Google ScholarDigital Library
Christian Bienia and Kai Li. 2010. Characteristics of workloads using the pipeline programming model. In International Symposium on Computer Architecture. Springer, 161--171.Google Scholar
George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Hérault, and Jack J Dongarra. 2013. Parsec: Exploiting heterogeneity to enhance scalability. Computing in Science & Engineering, Vol. 15, 6 (2013), 36--45.Google ScholarDigital Library
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. IEEE, 44--54.Google ScholarDigital Library
Nagai-Man Cheung, Xiaopeng Fan, Oscar C Au, and Man-Cheung Kung. 2010. Video coding on multicore graphics processors. IEEE Signal Processing Magazine, Vol. 27, 2 (2010), 79--89.Google ScholarCross Ref
Nitin Chugh, Vinay Vasista, Suresh Purini, and Uday Bondhugula. 2016. A DSL compiler for accelerating image processing pipelines on FPGAs. In Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, 327--338.Google ScholarDigital Library
Robert L Cook, Loren Carpenter, and Edwin Catmull. 1987. The Reyes image rendering architecture. In ACM SIGGRAPH Computer Graphics, Vol. 21. ACM, 95--102.Google ScholarDigital Library
Kshitij Gupta, Jeff A Stuart, and John D Owens. 2012. A study of persistent threads style GPU programming for GPGPU workloads. In Innovative Parallel Computing (InPar), 2012. IEEE, 1--14.Google ScholarCross Ref
M Harris and K Perelygin. 2017. Cooperative groups: Flexible CUDA thread programming.Google Scholar
Wei Huang, Shougata Ghosh, Sivakumar Velusamy, Karthik Sankaranarayanan, Kevin Skadron, and Mircea R Stan. 2006. HotSpot: A compact thermal modeling methodology for early-stage VLSI design. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 14, 5 (2006), 501--513.Google ScholarDigital Library
Brucek Khailany, William J Dally, Ujval J Kapasi, Peter Mattson, Jinyung Namkoong, John D Owens, Brian Towles, Andrew Chang, and Scott Rixner. 2001. Imagine: Media processing with streams. IEEE micro, Vol. 21, 2 (2001), 35--46.Google Scholar
Gwangsun Kim, Jiyun Jeong, John Kim, and Mark Stephenson. 2016. Automatically Exploiting Implicit Pipeline Parallelism from Multiple Dependent Kernels for GPUs. In International Conference on Parallel Architectures and Compilation.Google ScholarDigital Library
Kai Li and Jeffrey F Naughton. 2000. Multiprocessor main memory transaction processing. In Proceedings of the first international symposium on Databases in parallel and distributed systems. IEEE Computer Society Press, 177--187.Google ScholarDigital Library
Wei-Cheng Liao, Yuan-Ming Chang, Shao-Chung Wang, Chun-Chieh Yang, Jenq-Kuen Lee, and Yuan-Shin Hwang. 2018. Scheduling Methods to Optimize Dependent Programs for GPU Architecture. In Proceedings of the 47th International Conference on Parallel Processing Companion. ACM, 13.Google ScholarDigital Library
Alberto Magni, Christophe Dubach, and Michael O'Boyle. 2014. Automatic optimization of thread-coarsening for graphics processors. In Proceedings of the 23rd international conference on Parallel architectures and compilation. ACM, 455--466.Google ScholarDigital Library
MJ McDonnell. 1981. Box-filtering techniques. Computer Graphics and Image Processing, Vol. 17, 1 (1981), 65--70.Google ScholarCross Ref
Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. Polymage: Automatic optimization for image processing pipelines. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 429--443.Google ScholarDigital Library
Chanyoung Oh, Saehanseul Yi, and Youngmin Yi. 2015. Real-time face detection in Full HD images exploiting both embedded CPU and GPU. In 2015 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.Google ScholarCross Ref
Sylvain Paris, Samuel W Hasinoff, and Jan Kautz. 2011. Local Laplacian filters: Edge-aware image processing with a Laplacian pyramid. ACM Trans. Graph., Vol. 30, 4 (2011), 68--1.Google ScholarDigital Library
Anjul Patney and John D Owens. 2008. Real-time Reyes: Programmable pipelines and research challenges. ACM SIGGRAPH Asia 2008 Course Notes (2008).Google Scholar
Anjul Patney, Stanley Tzeng, Kerry A. Seitz, and John D. Owens. 2015. Piko: a framework for authoring programmable graphics pipelines. Acm Transactions on Graphics, Vol. 34, 4 (2015), 1--13.Google ScholarDigital Library
Antoniu Pop and Albert Cohen. 2013. OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs. ACM Transactions on Architecture and Code Optimization (TACO), Vol. 9, 4 (2013), 1--25.Google ScholarDigital Library
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices, Vol. 48, 6 (2013), 519--530.Google ScholarDigital Library
Mahesh Ravishankar, Justin Holewinski, and Vinod Grover. 2015. Forma: A DSL for image processing applications to target GPUs and multi-core CPUs. In Proceedings of the 8th Workshop on General Purpose Processing using GPUs. ACM, 109--120.Google ScholarDigital Library
Changhe Song, Yunsong Li, and Bormin Huang. 2011. A GPU-accelerated wavelet decompression system with SPIHT and Reed-Solomon decoding for satellite images. IEEE Journal of selected topics in applied earth observations and remote sensing, Vol. 4, 3 (2011), 683--690.Google ScholarCross Ref
Tyler Sorensen, Alastair F Donaldson, Mark Batty, Ganesh Gopalakrishnan, and Zvonimir Rakamarić. 2016. Portable inter-workgroup barrier synchronisation for GPUs. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications. 39--58.Google ScholarDigital Library
Markus Steinberger, Michael Kenzel, Pedro Boechat, Bernhard Kerbl, Mark Dokter, and Dieter Schmalstieg. 2014. Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU. Acm Transactions on Graphics, Vol. 33, 6 (2014), 1--11.Google ScholarDigital Library
Weibin Sun and Robert Ricci. 2013. Fast and flexible: Parallel packet processing with GPUs and Click. In Proceedings of the ninth ACM/IEEE symposium on Architectures for networking and communications systems. IEEE Press, 25--36.Google ScholarDigital Library
Stanley Tzeng, Brandon Lloyd, and John D Owens. 2012. A GPU task-parallel model with dependency resolution. Computer 8 (2012), 34--41.Google ScholarDigital Library
Stanley Tzeng, Anjul Patney, and John D Owens. 2010. Task management for irregular-parallel workloads on the GPU. In Proceedings of the Conference on High Performance Graphics. Eurographics Association, 29--37.Google ScholarDigital Library
Hans Vandierendonck, George Tzenakis, and Dimitrios S Nikolopoulos. 2011. A unified scheduler for recursive and task dataflow parallelism. In 2011 International Conference on Parallel Architectures and Compilation Techniques. IEEE, 1--11.Google ScholarDigital Library
Paul Viola and Michael Jones. 2001. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, Vol. 1. IEEE, I--511.Google ScholarCross Ref
Feng Zhang, Jidong Zhai, Bingsheng He, and Shuhao Zhang. 2016. Understanding Co-running Behaviors on Integrated CPU/GPU Architectures. IEEE Transactions on Parallel & Distributed Systems (2016), 1--1.Google Scholar
Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, and Wenguang Chen. 2017. Versapipe: a versatile programming framework for pipelined computing on GPU. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 587--599.Google ScholarDigital Library
Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, and Wenguang Chen. 2019. HiWayLib: A Software Framework for Enabling High Performance Communications for Heterogeneous Pipeline Computations. In Proceedings of the 24th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 153--166.Google ScholarDigital Library

Index Terms

GOPipe: A Granularity-Oblivious Programming Framework for Pipelined Stencil Executions on GPU

Recommendations

GOPipe: a granularity-oblivious programming framework for pipelined stencil executions on GPU
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

Recent studies have shown promising performance benefits of pipelined stencil applications. An important factor for the computing efficiency of such pipelines is the granularity of a task. We presents GOPipe, the first granularity-oblivious programming ...
Read More
G-Charm: an adaptive runtime system for message-driven parallel applications on hybrid systems
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

The effective use of GPUs for accelerating applications depends on a number of factors including effective asynchronous use of heterogeneous resources, reducing memory transfer between CPU and GPU, increasing occupancy of GPU kernels, overlapping data ...
Read More
OpenACC Unified Programming Environment for Multi-hybrid Acceleration with GPU and FPGA
High Performance Computing
Abstract
Accelerated computing in HPC such as with GPU, plays a central role in HPC nowadays. However, in some complicated applications with partially different performance behavior is hard to solve with a single type of accelerator where GPU is not the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques
September 2020
505 pages
ISBN:9781450380751
DOI:10.1145/3410463
General Chair:
Vivek Sarkar
Georgia Institute of Technology
,
Program Chair:
Hyesoon Kim
Georgia Institute of Technology
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 September 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
gpu
optimizations
programming framework
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate121of471submissions,26%
Upcoming Conference
PACT '24

Sponsor:

sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Southern California , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 267
  Total Downloads
- Downloads (Last 12 months)35
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

GOPipe: A Granularity-Oblivious Programming Framework for Pipelined Stencil Executions on GPU

PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

ABSTRACT

References

Cited By

Index Terms

Recommendations

GOPipe: a granularity-oblivious programming framework for pipelined stencil executions on GPU

G-Charm: an adaptive runtime system for message-driven parallel applications on hybrid systems

OpenACC Unified Programming Environment for Multi-hybrid Acceleration with GPU and FPGA