research-article

Unveiling kernel concurrency in multiresolution filters on GPUs with an image processing DSL

Authors:

Frank HannigAuthors Info & Claims

GPGPU '20: Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit

Pages 11 - 20

https://doi.org/10.1145/3366428.3380773

Published: 23 February 2020 Publication History

Abstract

Multiresolution filters, analyzing information at different scales, are crucial for many applications in digital image processing. The different space and time complexity at distinct scales in the unique pyramidal structure poses a challenge as well as an opportunity to implementations on modern accelerators such as GPUs with an increasing number of compute units. In this paper, we exploit the potential of concurrent kernel execution in multiresolution filters. As a major contribution, we present a model-based approach for performance analysis of as well single- as multi-stream implementations, combining both application- and architecture-specific knowledge. As a second contribution, the involved transformations and code generators using CUDA streams on Nvidia GPUs have been integrated into a compiler-based approach using an image processing DSL called Hipacc. We then apply our approach to evaluate and compare the achieved performance for four real-world applications on three GPUs. The results show that our method can achieve a geometric mean speedup of up to 2.5 over the original Hipacc implementation without our approach, up to 2.0 over the other state-of-the-art DSL Halide, and up to 1.3 over the recently released programming model CUDA Graph from Nvidia.

References

[1]

M. Zhang and B. K. Gunturk. "Multiresolution Bilateral Filtering for Image Denoising". In: IEEE Trans. on Image Processing 17.12 (Dec. 2008), pp. 2324--2333.

Digital Library

[2]

D. Kunz, K. Eck, H. Fillbrandt, and T. Aach. "Nonlinear Multiresolution Gradient Adaptive Filter for Medical Images". In: Proc. SPIE 5032 (Feb. 2003).

[3]

M. Unser, A. Aldroubi, and C. Gerfen. "Multiresolution Image Registration Procedure Using Spline Pyramids". In: Proc. SPIE 2034 (Nov. 1993).

[4]

S. Paris, S. W. Hasinoff, and J. Kautz. "Local Laplacian Filters: Edge-aware Image Processing with a Laplacian Pyramid". In: Commun. ACM 58.3 (Feb. 2015), pp. 81--91.

Digital Library

[5]

S. Rajbhandari, J. Kim, S. Krishnamoorthy, L.-N. Pouchet, F. Rastello, R. J. Harrison, and P. Sadayappan. "A Domain-specific Compiler for a Parallel Multiresolution Adaptive Numerical Simulation Environment". In: Proc. of the Int'l Conference for High Performance Computing, Networking, Storage and Analysis (SC). (Salt Lake City, UT, USA). IEEE Press, 2016, 40:1--40:12.

[6]

H. Ji, F.-S. Lien, and E. Yee. "Parallel Adaptive Mesh Refinement Combined with Additive Multigrid for the Efficient Solution of the Poisson Equation". In: ISRN Applied Mathematics 2012 (Mar. 2012).

[7]

P. Burt and E. Adelson. "The Laplacian Pyramid as a Compact Image Code". In: IEEE Trans. on Communications 31.4 (Apr. 1983), pp. 532--540.

[8]

Nvidia. NVIDIA-Turing-Architecture-Whitepaper. July 2019. url: https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf.

[9]

J. Ragan-Kelley, A. Adams, S. Paris, M. Levoy, S. Amarasinghe, and F. Durand. "Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines". In: ACM Trans. on Graphics 31.4 (July 2012), 32:1--32:12.

Digital Library

[10]

R. Membarth, O. Reiche, F. Hannig, J. Teich, M. Körner, and W. Eckert. "HIPAcc: A Domain-Specific Language and Compiler for Image Processing". In: IEEE Trans. on Parallel and Distributed Systems 27.1 (Jan. 2016), pp. 210--224.

Digital Library

[11]

R. T. Mullapudi, V. Vasista, and U. Bondhugula. "PolyMage: Automatic Optimization for Image Processing Pipelines". In: Proc. of the 20th Int'l Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). (Istanbul, Turkey). ACM, 2015, pp. 429--443.

Digital Library

[12]

R. Membarth, O. Reiche, C. Schmitt, F. Hannig, J. Teich, M. Stürmer, and H. Köstler. "Towards a Performance-portable Description of Geometric Multigrid Algorithms using a Domain-specific Language". In: J. of Parallel and Distributed Computing 74.12 (Dec. 2014), pp. 3191--3201.

Digital Library

[13]

A. Rosenfeld and A. C. Kak. Digital Picture Processing: Volume 1. 2nd ed. Morgan Kaufmann Publishers Inc., 1982. ISBN: 978-0-323-13991-5.

Digital Library

[14]

Nvidia. CUDA C Programming Guide. May 2019. url: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.

[15]

P. Micikevicius. GPU Performance Analysis and Optimization. May 2012. url: http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf.

[16]

H. Li, D. Yu, A. Kumar, and Y. Tu. "Performance Modeling in CUDA Streams - A Means for High-Throughput Data Processing". In: Proc. of the IEEE Int'l Conference on Big Data. Oct. 2014, pp. 301--310.

[17]

N. Otterness, M. Yang, T. Amert, J. Anderson, and F. D. Smith. "Inferring the Scheduling Policies of an Embedded CUDA GPU". In: Proc. of the 13th Annual Workshop on Operating Systems Platforms for Embedded Real-Time Applications (OSPERT). (Duprovnik, Croatia). June 2017, pp. 47--52.

[18]

C. Tomasi and R. Manduchi. "Bilateral Filtering for Gray and Color Images". In: Proc. of the 6th Int'l Conference on Computer Vision. Jan. 1998, pp. 839--846.

[19]

P. J. Burt and E. H. Adelson. "A Multiresolution Spline with Application to Image Mosaics". In: ACM Trans. on Graphics 2.4 (Oct. 1983), pp. 217--236.

Digital Library

[20]

J. Modersitzki. Fair: Flexible Algorithms for Image Registration. Society for Industrial and Applied Mathematics, 2009. isbn: 978-0-89871690-0.

[21]

Nvidia. CUDA 10Features Revealed: Turing, CUDA Graphs, and More. Sept. 2018. url: https://devblogs.nvidia.com/cuda - 10 - features - revealed.

[22]

Y. Liang, H. P. Huynh, K. Rupnow, R. S. M. Goh, and D. Chen. "Efficient GPU Spatial-Temporal Multitasking". In: IEEE Trans. on Parallel and Distributed Systems 26.3 (Mar. 2015), pp. 748--760.

[23]

Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. "Simultaneous Multikernel GPU: Multi-Tasking Throughput Processors via Fine-Grained Sharing". In: Proc. of the IEEE Int'l Symposium on High Performance Computer Architecture (HPCA). Mar. 2016, pp. 358--369.

[24]

Q. Xu, H. Jeon, K. Kim, W. W. Ro, and M. Annavaram. "Warped-slicer: Efficient Intra-SM Slicing Through Dynamic Resource Partitioning for GPU Multiprogramming". In: Proc. of the 43rd Int'l Symposium on Computer Architecture (ISCA). (Seoul, Republic of Korea). IEEE Press, 2016, pp. 230--242.

Digital Library

[25]

J. J. K. Park, Y. Park, and S. Mahlke. "Dynamic Resource Management for Efficient Utilization of Multitasking GPUs". In: Proc. of the 22nd Int'l Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). (Xi'an, China). ACM, 2017, pp. 527--540.

Digital Library

[26]

Z. Lin, H. Dai, M. Mantor, and H. Zhou. "Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution". In: ACM Trans. on Architecture and Code Optimization 16.3 (June 2019), 23:1--23:27.

Digital Library

[27]

S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. "Improving GPGPU Concurrency with Elastic Kernels". In: Proc. of the 18th Int'l Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). (Houston, TX, USA). ACM, 2013, pp. 407--418.

Digital Library

[28]

B. Qiao, O. Reiche, F. Hannig, and J. Teich. "Automatic Kernel Fusion for Image Processing DSLs". In: Proc. of the 21st Int'l Workshop on Software and Compilers for Embedded Systems (SCOPES). Sankt Goar, Germany: Association for Computing Machinery 2018 pp. 76--85. isbn: 978-1-4503-5780-7.

Digital Library

[29]

B. Qiao, O. Reiche, F. Hannig, and J. Teich. "From Loop Fusion to Kernel Fusion: A Domain-Specific Approach to Locality Optimization". In: Proc. of the IEEE/ACM Int'l Symposium on Code Generation and Optimization (CGO). Washington, DC, USA: IEEE Press, 2019, pp. 242--253.

[30]

T. Gysi, T. Grosser, and T. Hoefler. "Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One Shot". In: Proc. of the Int'l Conference on Parallel Architectures and Compilation Techniques (PACT). Sept. 2019, pp. 370--382.

Digital Library

Cited By

Gong SAltinbüken DFonseca PManiatis P(2021)SnowboardProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles10.1145/3477132.3483549(66-83)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3477132.3483549
Qiao BTeich JHannig F(2021)An Efficient Approach for Image Border Handling on GPUs via Iteration Space Partitioning2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW52791.2021.00067(387-396)Online publication date: Jun-2021
https://doi.org/10.1109/IPDPSW52791.2021.00067
Parravicini ADelamare AArnaboldi MSantambrogio M(2021)DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00020(111-120)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00020
Show More Cited By

Recommendations

The Minos Computing Library: efficient parallel programming for extremely heterogeneous systems
GPGPU '20: Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit

Hardware specialization has become the silver bullet to achieve efficient high performance, from Systems-on-Chip systems, where hardware specialization can be "extreme", to large-scale HPC systems. As the complexity of the systems increases, so does the ...
Custom code generation for a graph DSL
GPGPU '20: Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit

We present challenges faced in making a domain-specific language (DSL) for graph algorithms adapt to varying requirements of generating a spectrum of efficient parallel codes. Graph algorithms are at the heart of several applications, and achieving high ...
Automated test generation for OpenCL kernels using fuzzing and constraint solving
GPGPU '20: Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit

Graphics Processing Units (GPUs) are massively parallel processors offering performance acceleration and energy efficiency unmatched by current processors (CPUs) in computers. These advantages along with recent advances in the programmability of GPUs ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

GPGPU '20: Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit

February 2020

77 pages

ISBN:9781450370257

DOI:10.1145/3366428

Conference Chairs:
Adwait Jog,
Onur Kayiran,
Ashutosh Pattnaik

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 February 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Siemens Healthineers AG, Erlangen, Germany

Conference

PPoPP '20

Sponsor:

PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 23, 2020

California, San Diego

Acceptance Rates

GPGPU '20 Paper Acceptance Rate 7 of 12 submissions, 58%;

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
275
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gong SAltinbüken DFonseca PManiatis P(2021)SnowboardProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles10.1145/3477132.3483549(66-83)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3477132.3483549
Qiao BTeich JHannig F(2021)An Efficient Approach for Image Border Handling on GPUs via Iteration Space Partitioning2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW52791.2021.00067(387-396)Online publication date: Jun-2021
https://doi.org/10.1109/IPDPSW52791.2021.00067
Parravicini ADelamare AArnaboldi MSantambrogio M(2021)DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00020(111-120)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00020
Qiao BReiche OÖzkan MTeich JHannig FCorporaal H(2020)Efficient parallel reduction on GPUs with HipaccProceedings of the 23th International Workshop on Software and Compilers for Embedded Systems10.1145/3378678.3391885(58-61)Online publication date: 25-May-2020
https://dl.acm.org/doi/10.1145/3378678.3391885

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten