A compression-based memory-efficient optimization for out-of-core GPU stencil computation

Shen, Jingcheng; Long, Linbo; Deng, Xin; Okita, Masao; Ino, Fumihiko

doi:10.1007/s11227-023-05103-8

A compression-based memory-efficient optimization for out-of-core GPU stencil computation

Published: 20 February 2023

Volume 79, pages 11055–11077, (2023)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Jingcheng Shen¹,
Linbo Long¹,
Xin Deng¹,
Masao Okita² &
…
Fumihiko Ino²

368 Accesses
1 Citation
Explore all metrics

Abstract

A code for out-of-core stencil computation manages data that exceeds the memory capacity of a GPU. However, such a code necessitates frequent data transfers between the CPU and GPU, which often impede overall performance. In this work, we propose a compression-based, memory-efficient method to accelerate out-of-core stencil codes. First, an on-the-fly compression technique is integrated into the out-of-core computation to reduce CPU-GPU data transfers. Secondly, a single-working-buffer strategy is employed to reduce the GPU memory usage, enabling more data to be stored on the GPU for reuse, resulting in increased temporal blocking steps. Experimental results demonstrate that the proposed method significantly reduces the GPU memory usage by 21%, thereby creating space for doubling the number of temporal blocking steps compared to the codes without compression. Our proposed method has shown to help the high-order, data-transfer-bound stencil codes achieve speedups up to \(2.09\times \) for single-precision floating-point format and up to \(1.92\times \) for double-precision floating-point format on an NVIDIA Tesla V100 GPU in comparison with the codes without compression.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerating GPU-Based Out-of-Core Stencil Computation with On-the-Fly Compression

Memory Access Optimization of High-Order CFD Stencil Computations on GPU

An analytical GPU performance model for 3D stencil computations from the angle of data traffic

Article 26 February 2015

Data Availability

Source codes of this work are available at https://github.com/mlvssc/compstencil.

References

Serpa MS, Cruz EH, Diener M, Krause AM, Farrés A, Rosas C, Panetta J, Hanzich M, Navaux PO (2017) Strategies to improve the performance of a geophysics model for different manycore systems. In: 2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW). IEEE, pp 49–54
Farres A, Rosas C, Hanzich M, Jordà M, Peña A (2019) Performance evaluation of fully anisotropic elastic wave propagation on nvidia volta gpus. In: 81st EAGE Conference and Exhibition 2019, vol 2019. European Association of Geoscientists & Engineers, pp 1–5
Adams S, Payne J, Boppana R (2007) Finite difference time domain (fdtd) simulations using graphics processors. In: 2007 DoD High Performance Computing Modernization Program Users Group Conference. IEEE, pp 334–338
Tabik S, Peemen M, Romero LF (2018) A tuning approach for iterative multiple 3d stencil pipeline on gpus: Anisotropic nonlinear diffusion algorithm as case study. J Supercomput 74(4):1580–1608
Article Google Scholar
Shen J, Shigeoka K, Ino F, Hagihara K (2019) Gpu-based branch-and-bound method to solve large 0–1 knapsack problems with data-centric strategies. Concurr Comput Pract Exp 31(4):4954
Article Google Scholar
Nogueira B, Tavares E, Araujo J, Callou G (2019) Accelerating continuous grasp with a gpu. J Supercomput 75(9):5741–5759
Article Google Scholar
Mousa MH, Hussein MK (2021) High-performance simplification of triangular surfaces using a GPU. PloS one 16(8):0255832
Article Google Scholar
Mousa MH, Hussein MK (2022) Surface approximation using GPU-based localized fourier transform. J King Saud Univ-Comput Inform Sci 34(4):1431–1438
Google Scholar
Garcia-Molla VM, Alonso-Jordá P, García-Laguía R (2022) Parallel border tracking in binary images using GPUs. J Supercomput 78(7):9817–9839
Article Google Scholar
Mousa MH, Hussein MK (2022) Efficient UAV-based MEC using GPU-based PSO and voronoi diagrams. Comput Model Eng Sci 2022:10–32604
Google Scholar
Alqarni MA, Mousa MH, Hussein MK (2022) Task offloading using GPU-based particle swarm optimization for high-performance vehicular edge computing. J King Saud Univ-Comput Inf Sci 34(10):10356–10364
Google Scholar
Hussein MK, Mousa MH (2022) Efficient computation offloading of IoT-based workflows using discrete teaching learning-based optimization. Comput Mater Continua 73(2):3685–3703
Article Google Scholar
Jin G, Endo T, Matsuoka S (2013) A parallel optimization method for stencil computation on the domain that is bigger than memory capacity of gpus. In: 2013 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, pp 1–8
Sourouri M, Baden SB, Cai X (2017) Panda: a compiler framework for concurrent cpu+ gpu execution of 3d stencil computations on gpu-accelerated supercomputers. Int J Parallel Program 45(3):711–729
Article Google Scholar
Shimokawabe T, Endo T, Onodera N, Aoki T (2017) A stencil framework to realize large-scale computations beyond device memory capacity on gpu supercomputers. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, pp 525–529
Miki N, Ino F, Hagihara K (2019) PACC: a directive-based programming framework for out-of-core stencil computation on accelerators. Int J High Perform Comput Netw 13(1):19–34
Article Google Scholar
Shen J, Ino F, Farrés A, Hanzich M (2020) A data-centric directive-based framework to accelerate out-of-core stencil computation on a gpu. IEICE Trans Inf Sys 103(12):2421–2434
Article Google Scholar
Midorikawa H, Tan H (2015) Locality-aware stencil computations using flash SSDs as main memory extension. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, pp 1163–1168
Allombert V, Michea D, Dupros F, Bellier C, Bourgine B, Aochi H, Jubertie S (2014) An out-of-core GPU approach for accelerating geostatistical interpolation. Procedia Comput Sci 29:888–896
Article Google Scholar
Zeidan M, Nazmy T, Aref M (2015) GPU-based Out-of-Core HLBVH Construction. In: EGSR (EI &I), pp 41–50
Lee J, Kang H, Yeom H-J, Cheon S, Park J, Kim D (2021) Out-of-core GPU 2D-shift-FFT algorithm for ultra-high-resolution hologram generation. Optics Express 29(12):19094–19112
Article Google Scholar
Perepelkina A, Levchenko V, Zakirov A (2021) Extending the problem data size for GPU simulation beyond the GPU memory storage with LRnLA algorithms. J Phys Conf Series 1740(1):012054
Article Google Scholar
Zakirov AV, Korneev BA, Perepelkina AY (2022) Compact Update Algorithm for Numerical Schemes with Cross Stencil for Data Access Locality. In: Proceedings of the 2022 6th High Performance Computing and Cluster Technologies Conference, pp 51–58
Cappello F, Di S, Gok AM (2020) Fulfilling the promises of lossy compression for scientific applications. In: Smoky Mountains Computational Sciences and Engineering Conference. Springer, pp 99–116
Shen J, Wu Y, Okita M, Ino F (2022) Accelerating GPU-Based Out-of-Core Stencil Computation with On-the-Fly Compression. In: The 22nd International Conference on Parallel and Distributed Computing: Applications and Technologies (PDCAT), pp 3–14
Matsumura K, Zohouri HR, Wahib M, Endo T, Matsuoka S (2020) An5d: automated stencil framework for high-degree temporal blocking on gpus. In: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization, pp 199–211
Tao D, Di S, Liang X, Chen Z, Cappello F (2018) Improving performance of iterative methods by lossy checkponting. In: Proceedings of the 27th International Symposium on High-performance Parallel and Distributed Computing, pp 52–65
Calhoun J, Cappello F, Olson LN, Snir M, Gropp WD (2019) Exploring the feasibility of lossy compression for pde simulations. Int J High Perform Comput Appl 33(2):397–410
Article Google Scholar
Jin S, Grosset P, Biwer CM, Pulido J, Tian J, Tao D, Ahrens J (2020) Understanding gpu-based lossy compression for extreme-scale cosmological simulations. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, pp 105–115
Kriemann R, Ltaief H, Luong MB, Pérez FEH, Im HG, Keyes D (2022) High-Performance Spatial Data Compression for Scientific Applications. In: Proceedings of the 28th European Conference on Parallel Processing (Euro-Par), pp 403–418
Wu X-C, Di S, Dasgupta EM, Cappello F, Finkel H, Alexeev Y, Chong FT (2019) Full-state quantum circuit simulation by using data compression. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–24
Zhou Q, Chu C, Kumar N, Kousha P, Ghazimirsaeed S, Subramoni H, Panda D (2021) Designing high-performance mpi libraries with on-the-fly compression for modern gpu clusters. In: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, pp 444–453
Sun G, Kang S, Jun S-W (2022) BurstZ+: eliminating the communication bottleneck of scientific computing accelerators via accelerated compression. ACM Trans Reconfigurable Technol Syst 15(2):1–34
Article Google Scholar
Lindstrom P (2014) Fixed-rate compressed floating-point arrays. IEEE Trans Visual Comput Graph 20(12):2674–2683
Article Google Scholar
Chen P, He S, Zhang X, Chen S, Hong P, Yin Y, Sun X-H (2022) Accelerating tensor swapping in gpus with self-tuning compression. IEEE Trans Parallel Distrib Syst Early Access 33(12):4484–4498
Article Google Scholar
NVIDIA Corporation (2017) NVIDIA Tesla V100 GPU architecture. http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
Jia Z, Maggioni M, Staiger B, Scarpazza DP (2018) Dissecting the NVIDIA volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826
NVIDIA Corporation (2022) CUDA Runtime API Reference Manual v11.6.2. https://docs.nvidia.com/cuda/pdf/CUDA_Runtime_API.pdf
Huang T-W, Lin D-L, Lin C-X, Lin Y (2021) Taskflow: a lightweight parallel and heterogeneous task graph computing system. IEEE Trans Parallel Distrib Syst 33(6):1303–1320
Article Google Scholar
Huang T-W, Lin D-L, Lin Y, Lin C-X (2021) Taskflow: a general-purpose parallel and heterogeneous task programming system. IEEE Trans Computer-Aided Des Integr Circ Syst 41(5):1448–1452
Article Google Scholar
Kahn AB (1962) Topological sorting of large networks. Commun ACM 5(11):558–562
Article MATH Google Scholar
Liu X, Liu Y, Yang H, Liao J, Li M, Luan Z, Qian D (2022) Toward accelerated stencil computation by adapting tensor core unit on GPU. In: Proceedings of the 36th ACM International Conference on Supercomputing (ICS), pp 1–12
Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76
Article Google Scholar
Van der Pas R, Stotzer E, Terboven C (2017) Using OpenMP# the next step: affinity, accelerators, tasking, and simd. MIT press, Cambridge
Google Scholar
Di S, Tao D, Liang X, Cappello F (2019) Efficient lossy compression for scientific data based on pointwise relative error bound. IEEE Trans Parallel Distrib Syst 30(2):331–345. https://doi.org/10.1109/TPDS.2018.2859932
Article Google Scholar

Download references

Acknowledgements

This study was supported in part by the Japan Society for the Promotion of Science KAKENHI under grant 20K21794, grants from the National Natural Science Foundation of China 61902045 and 62172067, Chongqing High-Tech Research Key Program cstc2021jcyj-msxmX0981, cstc2021jcyj-msxmX0530, and cstb2022nscq-msx0601.

Funding

KAKENHI grant 20K21794 from the Japan Society for the Promotion of Science, grants from the National Natural Science Foundation of China 61902045 and 62172067, Chongqing High-Tech Research Key Program cstc2021jcyj-msxmX0981, cstc2021jcyj-msxmX0530, and cstb2022nscq-msx0601.

Author information

Authors and Affiliations

College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, 2 Chongwen Rd., Nan’an Dist., 400065, Chongqing, China
Jingcheng Shen, Linbo Long & Xin Deng
Graduate School of Information Science and Technology, Osaka University, 1-5 Yamadaoka, Suita, Osaka, 565-0871, Japan
Masao Okita & Fumihiko Ino

Authors

Jingcheng Shen
View author publications
You can also search for this author in PubMed Google Scholar
Linbo Long
View author publications
You can also search for this author in PubMed Google Scholar
Xin Deng
View author publications
You can also search for this author in PubMed Google Scholar
Masao Okita
View author publications
You can also search for this author in PubMed Google Scholar
Fumihiko Ino
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JS conducted the experiments and wrote the manuscript. LL has done a considerable amount of work to evaluate the work and to help revise the manuscript. FI partially supervised this study. XD and MO gave revising suggestions. All authors have reviewed the manuscript.

Corresponding author

Correspondence to Linbo Long.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shen, J., Long, L., Deng, X. et al. A compression-based memory-efficient optimization for out-of-core GPU stencil computation. J Supercomput 79, 11055–11077 (2023). https://doi.org/10.1007/s11227-023-05103-8

Download citation

Accepted: 04 February 2023
Published: 20 February 2023
Issue Date: July 2023
DOI: https://doi.org/10.1007/s11227-023-05103-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A compression-based memory-efficient optimization for out-of-core GPU stencil computation

Abstract

Access this article

Similar content being viewed by others

Accelerating GPU-Based Out-of-Core Stencil Computation with On-the-Fly Compression

Memory Access Optimization of High-Order CFD Stencil Computations on GPU

An analytical GPU performance model for 3D stencil computations from the angle of data traffic

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A compression-based memory-efficient optimization for out-of-core GPU stencil computation

Abstract

Access this article

Similar content being viewed by others

Accelerating GPU-Based Out-of-Core Stencil Computation with On-the-Fly Compression

Memory Access Optimization of High-Order CFD Stencil Computations on GPU

An analytical GPU performance model for 3D stencil computations from the angle of data traffic

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation