research-article

Automatic Kernel Fusion for Image Processing DSLs

Authors:
Bo Qiao

Hardware/Software Co-Design, Department of Computer Science, Friedrich-Alexander University Erlangen-Nürnberg (FAU)

Hardware/Software Co-Design, Department of Computer Science, Friedrich-Alexander University Erlangen-Nürnberg (FAU)
View Profile

,
Oliver Reiche

Hardware/Software Co-Design, Department of Computer Science, Friedrich-Alexander University Erlangen-Nürnberg (FAU)

Hardware/Software Co-Design, Department of Computer Science, Friedrich-Alexander University Erlangen-Nürnberg (FAU)
View Profile

,
Frank Hannig

Hardware/Software Co-Design, Department of Computer Science, Friedrich-Alexander University Erlangen-Nürnberg (FAU)

Hardware/Software Co-Design, Department of Computer Science, Friedrich-Alexander University Erlangen-Nürnberg (FAU)
View Profile

,
Jürgen Teich

Hardware/Software Co-Design, Department of Computer Science, Friedrich-Alexander University Erlangen-Nürnberg (FAU)

Hardware/Software Co-Design, Department of Computer Science, Friedrich-Alexander University Erlangen-Nürnberg (FAU)
View Profile

SCOPES '18: Proceedings of the 21st International Workshop on Software and Compilers for Embedded SystemsMay 2018Pages 76–85https://doi.org/10.1145/3207719.3207723

Published:28 May 2018Publication History

SCOPES '18: Proceedings of the 21st International Workshop on Software and Compilers for Embedded Systems

Pages 76–85

ABSTRACT

Programming image processing algorithms on hardware accelerators such as graphics processing units (GPUs) often exhibits a trade-off between software portability and performance portability. Domain-specific languages (DSLs) have proven to be a promising remedy, which enable optimizations and generation of efficient code from a concise, high-level algorithm representation.

The scope of this paper is an optimization framework for image processing DSLs in the form of a source-to-source compiler. To cope with the inter-kernel communication bound via global memory for GPU applications, kernel fusion is investigated as a primary optimization technique to improve temporal locality. In order to enable automatic kernel fusion, we analyze the fusibility of each kernel in the algorithm, in terms of data dependencies, resource utilization, and parallelism granularity. By combining the obtained information with the domain-specific knowledge captured in the DSL, a method to automatically fuse the suitable kernels is proposed and integrated into an open source DSL framework. The novel kernel fusion technique is evaluated on two filter-based image processing applications, for which speedups of up to 1.60 are obtained for an NVIDIA Geforce 745 graphics card target.

References

A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2006. ISBN: 0321486811. Google ScholarDigital Library
J. Filipovič, M. Madzin, J. Fousek, and L. Matyska. Optimizing CUDA code by kernel fusion: Application on BLAS. The Journal of Supercomputing, 71(10):3934--3957, Oct. 2015. ISSN: 1573-0484. Google ScholarDigital Library
C. Harris and M. Stephens. A combined corner and edge detector. In In Proceedings of the Fourth Alvey Vision Conference (AVC). (Manchester, UK), pages 147--151, Sept. 1988.Google ScholarCross Ref
H. W. Jensen, S. Premoze, P. Shirley, W. B. Thompson, J. A. Ferwerda, and M. M. Stark. Night Rendering. Technical report UUCS-00-016, Computer Science Department, University of Utah, Aug. 2000.Google Scholar
D. Koch, F. Hannig, and D. Ziener, editors. FPGAs for Software Programmers. Springer, June 2016. 327 pages. ISBN: 978-3-319-26406-6. Google ScholarDigital Library
R. Membarth, F. Hannig, J. Teich, M. Körner, and W. Eckert. Generating device-specific GPU code for local operators in medical imaging. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS). (Shanghai, China), pages 569--581. IEEE, May 21--25, 2012. ISBN: 978-0-7695-4675-9. Google ScholarDigital Library
R. Membarth, O. Reiche, F. Hannig, J. Teich, M. Körner, and W. Eckert. HIPAcc: A domain-specific language and compiler for image processing. IEEE Transactions on Parallel and Distributed Systems, 27(1):210--224, Jan. 2016. ISSN: 1045-9219. Google ScholarDigital Library
R. T. Mullapudi, A. Adams, D. Sharlet, J. Ragan-Kelley, and K. Fatahalian. Automatically scheduling Halide image processing pipelines. ACM Transactions on Graphics, 35(4):83:1--83:11, July 2016. ISSN: 0730-0301. Google ScholarDigital Library
R. T. Mullapudi, V. Vasista, and U. Bondhugula. Polymage: Automatic optimization for image processing pipelines. ACM SIGARCH Computer Architecture News, 43(1):429--443, Mar. 2015. ISSN: 0163-5964. Google ScholarDigital Library
J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). (Seattle, WA, USA), pages 519--530, New York, NY, USA. ACM, 2013. ISBN: 978-1-4503-2014-6. Google ScholarDigital Library
O. Reiche, M. Özkan, R. Membarth, J. Teich, and F. Hannig. Generating FPGA-based image processing accelerators with Hipacc. In Proceedings of the International Conference on Computer Aided Design (ICCAD). (Irvine, CA, USA), pages 1026--1033. IEEE, Nov. 13--16, 2017. ISBN: 978-1-5386-3094-5. Google ScholarDigital Library
O. Reiche, M. Schmid, F. Hannig, R. Membarth, and J. Teich. Code generation from a domain-specific language for C-based HLS of hardware accelerators. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). (New Dehli, India), 17:1--17:10. ACM, Oct. 12--17, 2014. ISBN: 978-1-4503-3051-0. Google ScholarDigital Library
M. J. Shensa. The discrete wavelet transform: Wedding the à trous and Mallat algorithms. IEEE Transactions on Signal Processing, 40(10):2464--2482, Oct. 1992. ISSN: 1053-587X. Google ScholarDigital Library
G. Wang, Y. Lin, and W. Yi. Kernel fusion: An effective method for better power efficiency on multithreaded GPU. In Proceedings of the 2010 IEEE/ACM Int'L Conference on Green Computing and Communications & Int'L Conference on Cyber, Physical and Social Computing, GREENCOM-CPSCOM '10, pages 344--350, Washington, DC, USA. IEEE Computer Society, 2010. ISBN: 978-0-7695-4331-4. Google ScholarDigital Library
H. Wu, G. Diamos, J. Wang, S. Cadambi, S. Yalamanchili, and S. Chakradhar. Optimizing data warehousing applications for GPUs using kernel fusion/fission. In Proceedings of the IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), pages 2433--2442, May 2012. Google ScholarDigital Library

Index Terms

Automatic Kernel Fusion for Image Processing DSLs
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors
    2. Image manipulation
      1. Image processing
2. Software and its engineering
  1. Software notations and tools
    1. Context specific languages
      1. Domain specific languages

Recommendations

Automated kernel fusion for GPU based on code motion
LCTES 2022: Proceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems

Applications implemented for GPU are important in various fields. GPU has many parallel computing cores and high arithmetic throughput, enabling GPU applications to work efficiently. However, the throughput of GPU memory, of which global memory is the ...
Read More
Optimizing CUDA code by kernel fusion: application on BLAS

Contemporary GPUs have significantly higher arithmetic throughput than a memory throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic power of the GPU. Examples of memory-bound kernels are BLAS-1 (vector---vector) and BLAS-2 ...
Read More
Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU
GREENCOM-CPSCOM '10: Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing

As one of the most popular accelerators, Graphics Processing Unit (GPU) has demonstrated high computing power in several application fields. On the other hand, GPU also produces high power consumption and has been one of the most largest power consumers ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SCOPES '18: Proceedings of the 21st International Workshop on Software and Compilers for Embedded Systems
May 2018
120 pages
ISBN:9781450357807
DOI:10.1145/3207719

Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 May 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Domain-Specific Languages
GPUs
Image Processing
Kernel Fusion
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate38of79submissions,48%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 237
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic Kernel Fusion for Image Processing DSLs

SCOPES '18: Proceedings of the 21st International Workshop on Software and Compilers for Embedded Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automated kernel fusion for GPU based on code motion

Optimizing CUDA code by kernel fusion: application on BLAS

Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Automatic Kernel Fusion for Image Processing DSLs

SCOPES '18: Proceedings of the 21st International Workshop on Software and Compilers for Embedded Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automated kernel fusion for GPU based on code motion

Optimizing CUDA code by kernel fusion: application on BLAS

Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media