ABSTRACT
GPUs have now been widely used in various computation-intensive applications, such as image processing, deep learning, artificial intelligence, etc. As these applications could be modeled by multiple GPU kernels, some of which might even be dependent, it is essential to find an efficient method to schedule dependent kernels on GPU cores. Simply observing dependences of kernels by executing them in sequence will result in performance degradation. Furthermore, dependent kernels generally need to share data. Consequently, without properly scheduling dependent kernels, unnecessary memory accesses and copies will be generated. This paper proposes an efficient method for scheduling dependent kernels on GPUs. Preliminary experimental results show that this technique improves performance by 43% on average when combining with appropriate memory write-back policies.
- Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).Google Scholar
- Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. IEEE, 163--174.Google ScholarCross Ref
- Yuan-Ming Chang, Shao-Chung Wang, Chun-Chieh Yang, Yuan-Shin Hwang, and Jenq-Kuen Lee. 2017. Enabling PoCL-based runtime frameworks on the HSA for OpenCL 2.0 support. Journal of Systems Architecture 81 (2017), 71--82. Google ScholarDigital Library
- Tai-Liang Chen, Shih-Huan Chien, and Jenq-Kuen Lee. 2018. ViennaCL++: Enable TensorFlow/Eigen via ViennaCL with OpenCL C++ Flow. In IWOCL (Poster). Google ScholarDigital Library
- Li-An Her and Jenq-Kuen Lee. 2018. OpenCL Vector Swizzling Optimization under LLVM Global Value Numbering. In April 2018 Workshop on Compilers for Parallel Computing (CPC).Google Scholar
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 675--678. Google ScholarDigital Library
- Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques. IEEE Press, 157--166. Google ScholarDigital Library
- Gwangsun Kim, Jiyun Jeong, John Kim, and Mark Stephenson. 2016. Automatically exploiting implicit Pipeline Parallelism from multiple dependent kernels for GPUs. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. ACM, 341--352. Google ScholarDigital Library
- Yu-Te Lin and Jenq-Kuen Lee. 2016. Vector data flow analysis for SIMD optimizations on OpenCL programs. Concurrency and Computation: Practice and Experience 28, 5 (2016), 1629--1654. Google ScholarDigital Library
- Fermi NVidia. 2009. NvidiaâĂŹs next generation cuda compute architecture. NVidia, Santa Clara, Calif USA (2009).Google Scholar
- Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. 2014. Enabling preemptive multiprogramming on GPUs. In ACM SIGARCH Computer Architecture News, Vol. 42. IEEE Press, 193--204. Google ScholarDigital Library
- Shao-Chung Wang, Li-Chen Kan, Chao-Lin Lee, Yuan-Shin Hwang, and Jenq-Kuen Lee. 2017. Architecture and Compiler Support for GPUs Using Energy-Efficient Affine Register Files. ACM Transactions on Design Automation of Electronic Systems (TODAES) 23, 2 (2017), 18. Google ScholarDigital Library
- Lin-Ya Yu, Shao-Chung Wang, and Jenq-Kuen Lee. 2017. Hierarchical Read/Write Analysis for Pointer-Based OpenCL Programs on RRAM. In Parallel Processing Workshops (ICPPW), 2017 46th International Conference on. IEEE, 45--52.Google Scholar
Index Terms
- Scheduling Methods to Optimize Dependent Programs for GPU Architecture
Recommendations
Exploration of CPU/GPU co-execution: from the perspective of performance, energy, and temperature
RACS '11: Proceedings of the 2011 ACM Symposium on Research in Applied ComputationIn recent computing systems, CPUs have encountered the situations in which they cannot meet the increasing throughput demands. To overcome the limits of CPUs in processing heavy tasks, especially for computer graphics, GPUs have been widely used. ...
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10: Proceedings of the 23rd international conference on Architecture of Computing SystemsWith fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness ...
Architecture-Aware Mapping and Optimization on a 1600-Core GPU
ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed SystemsThe graphics processing unit (GPU) continues to make in-roads as a computational accelerator for high-performance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task, it is a multi-...
Comments