ABSTRACT
Programming image processing algorithms on hardware accelerators such as graphics processing units (GPUs) often exhibits a trade-off between software portability and performance portability. Domain-specific languages (DSLs) have proven to be a promising remedy, which enable optimizations and generation of efficient code from a concise, high-level algorithm representation.
The scope of this paper is an optimization framework for image processing DSLs in the form of a source-to-source compiler. To cope with the inter-kernel communication bound via global memory for GPU applications, kernel fusion is investigated as a primary optimization technique to improve temporal locality. In order to enable automatic kernel fusion, we analyze the fusibility of each kernel in the algorithm, in terms of data dependencies, resource utilization, and parallelism granularity. By combining the obtained information with the domain-specific knowledge captured in the DSL, a method to automatically fuse the suitable kernels is proposed and integrated into an open source DSL framework. The novel kernel fusion technique is evaluated on two filter-based image processing applications, for which speedups of up to 1.60 are obtained for an NVIDIA Geforce 745 graphics card target.
- A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2006. ISBN: 0321486811. Google ScholarDigital Library
- J. Filipovič, M. Madzin, J. Fousek, and L. Matyska. Optimizing CUDA code by kernel fusion: Application on BLAS. The Journal of Supercomputing, 71(10):3934--3957, Oct. 2015. ISSN: 1573-0484. Google ScholarDigital Library
- C. Harris and M. Stephens. A combined corner and edge detector. In In Proceedings of the Fourth Alvey Vision Conference (AVC). (Manchester, UK), pages 147--151, Sept. 1988.Google ScholarCross Ref
- H. W. Jensen, S. Premoze, P. Shirley, W. B. Thompson, J. A. Ferwerda, and M. M. Stark. Night Rendering. Technical report UUCS-00-016, Computer Science Department, University of Utah, Aug. 2000.Google Scholar
- D. Koch, F. Hannig, and D. Ziener, editors. FPGAs for Software Programmers. Springer, June 2016. 327 pages. ISBN: 978-3-319-26406-6. Google ScholarDigital Library
- R. Membarth, F. Hannig, J. Teich, M. Körner, and W. Eckert. Generating device-specific GPU code for local operators in medical imaging. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS). (Shanghai, China), pages 569--581. IEEE, May 21--25, 2012. ISBN: 978-0-7695-4675-9. Google ScholarDigital Library
- R. Membarth, O. Reiche, F. Hannig, J. Teich, M. Körner, and W. Eckert. HIPAcc: A domain-specific language and compiler for image processing. IEEE Transactions on Parallel and Distributed Systems, 27(1):210--224, Jan. 2016. ISSN: 1045-9219. Google ScholarDigital Library
- R. T. Mullapudi, A. Adams, D. Sharlet, J. Ragan-Kelley, and K. Fatahalian. Automatically scheduling Halide image processing pipelines. ACM Transactions on Graphics, 35(4):83:1--83:11, July 2016. ISSN: 0730-0301. Google ScholarDigital Library
- R. T. Mullapudi, V. Vasista, and U. Bondhugula. Polymage: Automatic optimization for image processing pipelines. ACM SIGARCH Computer Architecture News, 43(1):429--443, Mar. 2015. ISSN: 0163-5964. Google ScholarDigital Library
- J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). (Seattle, WA, USA), pages 519--530, New York, NY, USA. ACM, 2013. ISBN: 978-1-4503-2014-6. Google ScholarDigital Library
- O. Reiche, M. Özkan, R. Membarth, J. Teich, and F. Hannig. Generating FPGA-based image processing accelerators with Hipacc. In Proceedings of the International Conference on Computer Aided Design (ICCAD). (Irvine, CA, USA), pages 1026--1033. IEEE, Nov. 13--16, 2017. ISBN: 978-1-5386-3094-5. Google ScholarDigital Library
- O. Reiche, M. Schmid, F. Hannig, R. Membarth, and J. Teich. Code generation from a domain-specific language for C-based HLS of hardware accelerators. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). (New Dehli, India), 17:1--17:10. ACM, Oct. 12--17, 2014. ISBN: 978-1-4503-3051-0. Google ScholarDigital Library
- M. J. Shensa. The discrete wavelet transform: Wedding the à trous and Mallat algorithms. IEEE Transactions on Signal Processing, 40(10):2464--2482, Oct. 1992. ISSN: 1053-587X. Google ScholarDigital Library
- G. Wang, Y. Lin, and W. Yi. Kernel fusion: An effective method for better power efficiency on multithreaded GPU. In Proceedings of the 2010 IEEE/ACM Int'L Conference on Green Computing and Communications & Int'L Conference on Cyber, Physical and Social Computing, GREENCOM-CPSCOM '10, pages 344--350, Washington, DC, USA. IEEE Computer Society, 2010. ISBN: 978-0-7695-4331-4. Google ScholarDigital Library
- H. Wu, G. Diamos, J. Wang, S. Cadambi, S. Yalamanchili, and S. Chakradhar. Optimizing data warehousing applications for GPUs using kernel fusion/fission. In Proceedings of the IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), pages 2433--2442, May 2012. Google ScholarDigital Library
Index Terms
- Automatic Kernel Fusion for Image Processing DSLs
Recommendations
Automated kernel fusion for GPU based on code motion
LCTES 2022: Proceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded SystemsApplications implemented for GPU are important in various fields. GPU has many parallel computing cores and high arithmetic throughput, enabling GPU applications to work efficiently. However, the throughput of GPU memory, of which global memory is the ...
Optimizing CUDA code by kernel fusion: application on BLAS
Contemporary GPUs have significantly higher arithmetic throughput than a memory throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic power of the GPU. Examples of memory-bound kernels are BLAS-1 (vector---vector) and BLAS-2 ...
Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU
GREENCOM-CPSCOM '10: Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social ComputingAs one of the most popular accelerators, Graphics Processing Unit (GPU) has demonstrated high computing power in several application fields. On the other hand, GPU also produces high power consumption and has been one of the most largest power consumers ...
Comments