skip to main content
10.1145/3410463.3414656acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

GOPipe: A Granularity-Oblivious Programming Framework for Pipelined Stencil Executions on GPU

Published:30 September 2020Publication History

ABSTRACT

Recent studies have shown promising performance benefits when multiple stages of a pipelined stencil application are mapped to different parts of a GPU to run concurrently. An important factor for the computing efficiency of such pipelines is the granularity of a task. In previous programming frameworks that support true pipelined computations on GPU, the choice has to be made by the programmers during the application development time. Due to many difficulties, programmers' decisions are often far from optimal, causing inferior performance and performance portability.

This paper presents GOPipe, a granularity-oblivious programming framework for efficient pipelined stencil executions on GPU. With GOPipe, programmers no longer need to specify the appropriate task granularity. GOPipe automatically finds it, and dynamically schedules tasks of that granularity for efficiency while observing all inter-task and inter-stage data dependencies. In our experiments on six real-life applications and various scenarios, GOPipe outperforms the state-of-the-art system by 1.39X on average with a much better programming productivity.

References

  1. Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. 2019. Learning to optimize halide with tree search and random programs. ACM Transactions on Graphics (TOG), Vol. 38, 4 (2019), 121.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Edward H Adelson, Charles H Anderson, James R Bergen, Peter J Burt, and Joan M Ogden. 1984. Pyramid methods in image processing. RCA engineer, Vol. 29, 6 (1984), 33--41.Google ScholarGoogle Scholar
  3. Timo Aila and Samuli Laine. 2009. Understanding the efficiency of ray traversal on GPUs. In Conference on High PERFORMANCE Graphics. 145--149.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Prithayan Barua, Jun Shirako, and Vivek Sarkar. 2018. Cost-driven thread coarsening for GPU kernels. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques. ACM, 32.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Muthu Manikandan Baskaran, Nagavijayalakshmi Vydyanathan, Uday Kumar Reddy Bondhugula, Jagannathan Ramanujam, Atanas Rountev, and Ponnuswamy Sadayappan. 2009. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. ACM sigplan notices, Vol. 44, 4 (2009), 219--228.Google ScholarGoogle Scholar
  6. Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing locality and independence with logical regions. In SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1--11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Mehmet E Belviranli, Seyong Lee, Jeffrey S Vetter, and Laxmi N Bhuyan. 2018. Juggler: a dependence-aware task-based execution framework for GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 54--67.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Christian Bienia and Kai Li. 2010. Characteristics of workloads using the pipeline programming model. In International Symposium on Computer Architecture. Springer, 161--171.Google ScholarGoogle Scholar
  9. George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Hérault, and Jack J Dongarra. 2013. Parsec: Exploiting heterogeneity to enhance scalability. Computing in Science & Engineering, Vol. 15, 6 (2013), 36--45.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. IEEE, 44--54.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Nagai-Man Cheung, Xiaopeng Fan, Oscar C Au, and Man-Cheung Kung. 2010. Video coding on multicore graphics processors. IEEE Signal Processing Magazine, Vol. 27, 2 (2010), 79--89.Google ScholarGoogle ScholarCross RefCross Ref
  12. Nitin Chugh, Vinay Vasista, Suresh Purini, and Uday Bondhugula. 2016. A DSL compiler for accelerating image processing pipelines on FPGAs. In Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, 327--338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Robert L Cook, Loren Carpenter, and Edwin Catmull. 1987. The Reyes image rendering architecture. In ACM SIGGRAPH Computer Graphics, Vol. 21. ACM, 95--102.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kshitij Gupta, Jeff A Stuart, and John D Owens. 2012. A study of persistent threads style GPU programming for GPGPU workloads. In Innovative Parallel Computing (InPar), 2012. IEEE, 1--14.Google ScholarGoogle ScholarCross RefCross Ref
  15. M Harris and K Perelygin. 2017. Cooperative groups: Flexible CUDA thread programming.Google ScholarGoogle Scholar
  16. Wei Huang, Shougata Ghosh, Sivakumar Velusamy, Karthik Sankaranarayanan, Kevin Skadron, and Mircea R Stan. 2006. HotSpot: A compact thermal modeling methodology for early-stage VLSI design. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 14, 5 (2006), 501--513.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Brucek Khailany, William J Dally, Ujval J Kapasi, Peter Mattson, Jinyung Namkoong, John D Owens, Brian Towles, Andrew Chang, and Scott Rixner. 2001. Imagine: Media processing with streams. IEEE micro, Vol. 21, 2 (2001), 35--46.Google ScholarGoogle Scholar
  18. Gwangsun Kim, Jiyun Jeong, John Kim, and Mark Stephenson. 2016. Automatically Exploiting Implicit Pipeline Parallelism from Multiple Dependent Kernels for GPUs. In International Conference on Parallel Architectures and Compilation.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kai Li and Jeffrey F Naughton. 2000. Multiprocessor main memory transaction processing. In Proceedings of the first international symposium on Databases in parallel and distributed systems. IEEE Computer Society Press, 177--187.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Wei-Cheng Liao, Yuan-Ming Chang, Shao-Chung Wang, Chun-Chieh Yang, Jenq-Kuen Lee, and Yuan-Shin Hwang. 2018. Scheduling Methods to Optimize Dependent Programs for GPU Architecture. In Proceedings of the 47th International Conference on Parallel Processing Companion. ACM, 13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Alberto Magni, Christophe Dubach, and Michael O'Boyle. 2014. Automatic optimization of thread-coarsening for graphics processors. In Proceedings of the 23rd international conference on Parallel architectures and compilation. ACM, 455--466.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. MJ McDonnell. 1981. Box-filtering techniques. Computer Graphics and Image Processing, Vol. 17, 1 (1981), 65--70.Google ScholarGoogle ScholarCross RefCross Ref
  23. Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. Polymage: Automatic optimization for image processing pipelines. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 429--443.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Chanyoung Oh, Saehanseul Yi, and Youngmin Yi. 2015. Real-time face detection in Full HD images exploiting both embedded CPU and GPU. In 2015 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  25. Sylvain Paris, Samuel W Hasinoff, and Jan Kautz. 2011. Local Laplacian filters: Edge-aware image processing with a Laplacian pyramid. ACM Trans. Graph., Vol. 30, 4 (2011), 68--1.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Anjul Patney and John D Owens. 2008. Real-time Reyes: Programmable pipelines and research challenges. ACM SIGGRAPH Asia 2008 Course Notes (2008).Google ScholarGoogle Scholar
  27. Anjul Patney, Stanley Tzeng, Kerry A. Seitz, and John D. Owens. 2015. Piko: a framework for authoring programmable graphics pipelines. Acm Transactions on Graphics, Vol. 34, 4 (2015), 1--13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Antoniu Pop and Albert Cohen. 2013. OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs. ACM Transactions on Architecture and Code Optimization (TACO), Vol. 9, 4 (2013), 1--25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices, Vol. 48, 6 (2013), 519--530.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Mahesh Ravishankar, Justin Holewinski, and Vinod Grover. 2015. Forma: A DSL for image processing applications to target GPUs and multi-core CPUs. In Proceedings of the 8th Workshop on General Purpose Processing using GPUs. ACM, 109--120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Changhe Song, Yunsong Li, and Bormin Huang. 2011. A GPU-accelerated wavelet decompression system with SPIHT and Reed-Solomon decoding for satellite images. IEEE Journal of selected topics in applied earth observations and remote sensing, Vol. 4, 3 (2011), 683--690.Google ScholarGoogle ScholarCross RefCross Ref
  32. Tyler Sorensen, Alastair F Donaldson, Mark Batty, Ganesh Gopalakrishnan, and Zvonimir Rakamarić. 2016. Portable inter-workgroup barrier synchronisation for GPUs. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications. 39--58.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Markus Steinberger, Michael Kenzel, Pedro Boechat, Bernhard Kerbl, Mark Dokter, and Dieter Schmalstieg. 2014. Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU. Acm Transactions on Graphics, Vol. 33, 6 (2014), 1--11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Weibin Sun and Robert Ricci. 2013. Fast and flexible: Parallel packet processing with GPUs and Click. In Proceedings of the ninth ACM/IEEE symposium on Architectures for networking and communications systems. IEEE Press, 25--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Stanley Tzeng, Brandon Lloyd, and John D Owens. 2012. A GPU task-parallel model with dependency resolution. Computer 8 (2012), 34--41.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Stanley Tzeng, Anjul Patney, and John D Owens. 2010. Task management for irregular-parallel workloads on the GPU. In Proceedings of the Conference on High Performance Graphics. Eurographics Association, 29--37.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Hans Vandierendonck, George Tzenakis, and Dimitrios S Nikolopoulos. 2011. A unified scheduler for recursive and task dataflow parallelism. In 2011 International Conference on Parallel Architectures and Compilation Techniques. IEEE, 1--11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Paul Viola and Michael Jones. 2001. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, Vol. 1. IEEE, I--511.Google ScholarGoogle ScholarCross RefCross Ref
  39. Feng Zhang, Jidong Zhai, Bingsheng He, and Shuhao Zhang. 2016. Understanding Co-running Behaviors on Integrated CPU/GPU Architectures. IEEE Transactions on Parallel & Distributed Systems (2016), 1--1.Google ScholarGoogle Scholar
  40. Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, and Wenguang Chen. 2017. Versapipe: a versatile programming framework for pipelined computing on GPU. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 587--599.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, and Wenguang Chen. 2019. HiWayLib: A Software Framework for Enabling High Performance Communications for Heterogeneous Pipeline Computations. In Proceedings of the 24th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 153--166.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. GOPipe: A Granularity-Oblivious Programming Framework for Pipelined Stencil Executions on GPU

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques
          September 2020
          505 pages
          ISBN:9781450380751
          DOI:10.1145/3410463
          • General Chair:
          • Vivek Sarkar,
          • Program Chair:
          • Hyesoon Kim

          Copyright © 2020 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 30 September 2020

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate121of471submissions,26%

          Upcoming Conference

          PACT '24
          International Conference on Parallel Architectures and Compilation Techniques
          October 14 - 16, 2024
          Southern California , CA , USA

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader