skip to main content
10.1145/3229710.3229723acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Scheduling Methods to Optimize Dependent Programs for GPU Architecture

Authors Info & Claims
Published:13 August 2018Publication History

ABSTRACT

GPUs have now been widely used in various computation-intensive applications, such as image processing, deep learning, artificial intelligence, etc. As these applications could be modeled by multiple GPU kernels, some of which might even be dependent, it is essential to find an efficient method to schedule dependent kernels on GPU cores. Simply observing dependences of kernels by executing them in sequence will result in performance degradation. Furthermore, dependent kernels generally need to share data. Consequently, without properly scheduling dependent kernels, unnecessary memory accesses and copies will be generated. This paper proposes an efficient method for scheduling dependent kernels on GPUs. Preliminary experimental results show that this technique improves performance by 43% on average when combining with appropriate memory write-back policies.

References

  1. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).Google ScholarGoogle Scholar
  2. Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. IEEE, 163--174.Google ScholarGoogle ScholarCross RefCross Ref
  3. Yuan-Ming Chang, Shao-Chung Wang, Chun-Chieh Yang, Yuan-Shin Hwang, and Jenq-Kuen Lee. 2017. Enabling PoCL-based runtime frameworks on the HSA for OpenCL 2.0 support. Journal of Systems Architecture 81 (2017), 71--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Tai-Liang Chen, Shih-Huan Chien, and Jenq-Kuen Lee. 2018. ViennaCL++: Enable TensorFlow/Eigen via ViennaCL with OpenCL C++ Flow. In IWOCL (Poster). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Li-An Her and Jenq-Kuen Lee. 2018. OpenCL Vector Swizzling Optimization under LLVM Global Value Numbering. In April 2018 Workshop on Compilers for Parallel Computing (CPC).Google ScholarGoogle Scholar
  6. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 675--678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques. IEEE Press, 157--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Gwangsun Kim, Jiyun Jeong, John Kim, and Mark Stephenson. 2016. Automatically exploiting implicit Pipeline Parallelism from multiple dependent kernels for GPUs. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. ACM, 341--352. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Yu-Te Lin and Jenq-Kuen Lee. 2016. Vector data flow analysis for SIMD optimizations on OpenCL programs. Concurrency and Computation: Practice and Experience 28, 5 (2016), 1629--1654. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Fermi NVidia. 2009. NvidiaâĂŹs next generation cuda compute architecture. NVidia, Santa Clara, Calif USA (2009).Google ScholarGoogle Scholar
  11. Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. 2014. Enabling preemptive multiprogramming on GPUs. In ACM SIGARCH Computer Architecture News, Vol. 42. IEEE Press, 193--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Shao-Chung Wang, Li-Chen Kan, Chao-Lin Lee, Yuan-Shin Hwang, and Jenq-Kuen Lee. 2017. Architecture and Compiler Support for GPUs Using Energy-Efficient Affine Register Files. ACM Transactions on Design Automation of Electronic Systems (TODAES) 23, 2 (2017), 18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Lin-Ya Yu, Shao-Chung Wang, and Jenq-Kuen Lee. 2017. Hierarchical Read/Write Analysis for Pointer-Based OpenCL Programs on RRAM. In Parallel Processing Workshops (ICPPW), 2017 46th International Conference on. IEEE, 45--52.Google ScholarGoogle Scholar

Index Terms

  1. Scheduling Methods to Optimize Dependent Programs for GPU Architecture

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ICPP Workshops '18: Workshop Proceedings of the 47th International Conference on Parallel Processing
      August 2018
      409 pages
      ISBN:9781450365239
      DOI:10.1145/3229710

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 August 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate91of313submissions,29%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader