skip to main content
10.1145/3243176.3243196acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Cost-driven thread coarsening for GPU kernels

Published:01 November 2018Publication History

ABSTRACT

Directive-based programming models like OpenACC provide a higher level abstraction and low overhead approach of porting existing applications to GPGPUs and other heterogeneous HPC hardware. Such programming models increase the design space exploration possible at the compiler level to exploit specific features of different architectures. We observed that traditional applications designed for latency optimized out-of-order pipelined CPUs do not exploit the throughput optimized in-order pipelined GPU architecture efficiently. In this paper we develop a model to estimate the memory throughput of a given application. Then we use the loop interleave transformation to improve the memory bandwidth utilization of a given kernel.

We developed a heuristic to estimate the optimal loop interleave factor, and implemented it in the OpenARC compiler for OpenACC. We evaluated our approach on over 216 kernels to achieve a Geo-mean speedup of 1.32×.

Our compiler optimization aims to provide the right balance between performance, portability and productivity.

References

  1. 2018. OpenACC. https://www.openacc.org/Google ScholarGoogle Scholar
  2. 2018. OpenMP. https://www.openmp.org/Google ScholarGoogle Scholar
  3. Hansang Bae, Dheya Mustafa, Jae-Woo Lee, Aurangzeb, Hao Lin, Chirag Dave, Rudolf Eigenmann, and Samuel P. Midkiff. 2013. The Cetus Source-to-Source Compiler Infrastructure: Overview and Evaluation. Int. J. Parallel Program. 41, 6 (Dec. 2013), 753--767. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS Parallel Benchmarks---Summary and Preliminary Results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91). ACM, New York, NY, USA, 158--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC) (IISWC '09). IEEE Computer Society, Washington, DC, USA, 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Cummins, P. Petoumenos, Z. Wang, and H. Leather. 2017. End-to-End Deep Learning of Optimization Heuristics. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). 219--232.Google ScholarGoogle Scholar
  7. Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. 1987. The Program Dependence Graph and Its Use in Optimization. ACM Trans. Program. Lang. Syst. 9, 3 (July 1987), 319--349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Sunpyo Hong and Hyesoon Kim. 2009. An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA '09). ACM, New York, NY, USA, 152--163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Q. Jia and H. Zhou. 2016. Tuning Stencil codes in OpenCL for FPGAs. In 2016 IEEE 34th International Conference on Computer Design (ICCD). 249--256.Google ScholarGoogle Scholar
  10. Jungwon Kim, Seyong Lee, and Jeffrey S. Vetter. 2015. An OpenACC-based Unified Programming Model for Multi-accelerator Systems. SIGPLAN Not. 50, 8 (Jan. 2015), 257--258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Lee and J. S. Vetter. 2014. OpenARC: Extensible OpenACC Compiler Framework for Directive-Based Accelerator Programming Study. In 2014 First Workshop on Accelerator Programming using Directives. 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Seyong Lee and Jeffrey S Vetter. 2014. OpenARC: Open Accelerator Research Compiler for Directive-Based, Efficient Heterogeneous Computing. In HPDC Proceedings of the ACM Symposium on High-Performance Parallel and Distributed Computing, Short Paper. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. John D. C. Little. 2011. OR FORUM---Little's Law As Viewed on Its 50th Anniversary. Oper. Res. 59, 3 (May 2011), 536--549. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Alberto Magni, Christophe Dubach, and Michael O'Boyle. 2014. Automatic Optimization of Thread-coarsening for Graphics Processors. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14). ACM, New York, NY, USA, 455--466. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Alberto Magni, Christophe Dubach, and Michael F. P. O'Boyle. 2013. A Large-scale Cross-architecture Evaluation of Thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, Article 11, 11 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. NVIDIA. 2018. Cuda Programming Guide. (2018). http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.htmlGoogle ScholarGoogle Scholar
  17. Vivek Sarkar. 2000. Optimized Unrolling of Nested Loops. In Proceedings of the 14th International Conference on Super-computing (ICS '00). ACM, New York, NY, USA, 153--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. 2012. A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '12). ACM, New York, NY, USA, 11--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. IEEE Des. Test 12, 3 (May 2010), 66--73.Google ScholarGoogle Scholar
  20. Swapneela Unkule, Christopher Shaltz, and Apan Qasem. 2012. Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality. In Proceedings of the 21st International Conference on Compiler Construction (CC'12). Springer-Verlag, Berlin, Heidelberg, 21--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Vasily Volkov. 2016. Understanding Latency Hiding on GPUs. Ph.D. Dissertation. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.htmlGoogle ScholarGoogle Scholar
  22. V. Volkov and J.W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE Press, 111. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Cost-driven thread coarsening for GPU kernels

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques
        November 2018
        494 pages
        ISBN:9781450359863
        DOI:10.1145/3243176

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 November 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate121of471submissions,26%

        Upcoming Conference

        PACT '24
        International Conference on Parallel Architectures and Compilation Techniques
        October 14 - 16, 2024
        Southern California , CA , USA

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader