skip to main content
10.1145/3207719.3207723acmotherconferencesArticle/Chapter ViewAbstractPublication PagesscopesConference Proceedingsconference-collections
research-article

Automatic Kernel Fusion for Image Processing DSLs

Authors Info & Claims
Published:28 May 2018Publication History

ABSTRACT

Programming image processing algorithms on hardware accelerators such as graphics processing units (GPUs) often exhibits a trade-off between software portability and performance portability. Domain-specific languages (DSLs) have proven to be a promising remedy, which enable optimizations and generation of efficient code from a concise, high-level algorithm representation.

The scope of this paper is an optimization framework for image processing DSLs in the form of a source-to-source compiler. To cope with the inter-kernel communication bound via global memory for GPU applications, kernel fusion is investigated as a primary optimization technique to improve temporal locality. In order to enable automatic kernel fusion, we analyze the fusibility of each kernel in the algorithm, in terms of data dependencies, resource utilization, and parallelism granularity. By combining the obtained information with the domain-specific knowledge captured in the DSL, a method to automatically fuse the suitable kernels is proposed and integrated into an open source DSL framework. The novel kernel fusion technique is evaluated on two filter-based image processing applications, for which speedups of up to 1.60 are obtained for an NVIDIA Geforce 745 graphics card target.

References

  1. A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2006. ISBN: 0321486811. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Filipovič, M. Madzin, J. Fousek, and L. Matyska. Optimizing CUDA code by kernel fusion: Application on BLAS. The Journal of Supercomputing, 71(10):3934--3957, Oct. 2015. ISSN: 1573-0484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Harris and M. Stephens. A combined corner and edge detector. In In Proceedings of the Fourth Alvey Vision Conference (AVC). (Manchester, UK), pages 147--151, Sept. 1988.Google ScholarGoogle ScholarCross RefCross Ref
  4. H. W. Jensen, S. Premoze, P. Shirley, W. B. Thompson, J. A. Ferwerda, and M. M. Stark. Night Rendering. Technical report UUCS-00-016, Computer Science Department, University of Utah, Aug. 2000.Google ScholarGoogle Scholar
  5. D. Koch, F. Hannig, and D. Ziener, editors. FPGAs for Software Programmers. Springer, June 2016. 327 pages. ISBN: 978-3-319-26406-6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Membarth, F. Hannig, J. Teich, M. Körner, and W. Eckert. Generating device-specific GPU code for local operators in medical imaging. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS). (Shanghai, China), pages 569--581. IEEE, May 21--25, 2012. ISBN: 978-0-7695-4675-9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Membarth, O. Reiche, F. Hannig, J. Teich, M. Körner, and W. Eckert. HIPAcc: A domain-specific language and compiler for image processing. IEEE Transactions on Parallel and Distributed Systems, 27(1):210--224, Jan. 2016. ISSN: 1045-9219. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. T. Mullapudi, A. Adams, D. Sharlet, J. Ragan-Kelley, and K. Fatahalian. Automatically scheduling Halide image processing pipelines. ACM Transactions on Graphics, 35(4):83:1--83:11, July 2016. ISSN: 0730-0301. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. T. Mullapudi, V. Vasista, and U. Bondhugula. Polymage: Automatic optimization for image processing pipelines. ACM SIGARCH Computer Architecture News, 43(1):429--443, Mar. 2015. ISSN: 0163-5964. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). (Seattle, WA, USA), pages 519--530, New York, NY, USA. ACM, 2013. ISBN: 978-1-4503-2014-6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. O. Reiche, M. Özkan, R. Membarth, J. Teich, and F. Hannig. Generating FPGA-based image processing accelerators with Hipacc. In Proceedings of the International Conference on Computer Aided Design (ICCAD). (Irvine, CA, USA), pages 1026--1033. IEEE, Nov. 13--16, 2017. ISBN: 978-1-5386-3094-5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. O. Reiche, M. Schmid, F. Hannig, R. Membarth, and J. Teich. Code generation from a domain-specific language for C-based HLS of hardware accelerators. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). (New Dehli, India), 17:1--17:10. ACM, Oct. 12--17, 2014. ISBN: 978-1-4503-3051-0. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. J. Shensa. The discrete wavelet transform: Wedding the à trous and Mallat algorithms. IEEE Transactions on Signal Processing, 40(10):2464--2482, Oct. 1992. ISSN: 1053-587X. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. Wang, Y. Lin, and W. Yi. Kernel fusion: An effective method for better power efficiency on multithreaded GPU. In Proceedings of the 2010 IEEE/ACM Int'L Conference on Green Computing and Communications & Int'L Conference on Cyber, Physical and Social Computing, GREENCOM-CPSCOM '10, pages 344--350, Washington, DC, USA. IEEE Computer Society, 2010. ISBN: 978-0-7695-4331-4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. Wu, G. Diamos, J. Wang, S. Cadambi, S. Yalamanchili, and S. Chakradhar. Optimizing data warehousing applications for GPUs using kernel fusion/fission. In Proceedings of the IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), pages 2433--2442, May 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatic Kernel Fusion for Image Processing DSLs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          SCOPES '18: Proceedings of the 21st International Workshop on Software and Compilers for Embedded Systems
          May 2018
          120 pages
          ISBN:9781450357807
          DOI:10.1145/3207719

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 28 May 2018

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          Overall Acceptance Rate38of79submissions,48%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader