Skip to main content
Log in

A compression-based memory-efficient optimization for out-of-core GPU stencil computation

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

A code for out-of-core stencil computation manages data that exceeds the memory capacity of a GPU. However, such a code necessitates frequent data transfers between the CPU and GPU, which often impede overall performance. In this work, we propose a compression-based, memory-efficient method to accelerate out-of-core stencil codes. First, an on-the-fly compression technique is integrated into the out-of-core computation to reduce CPU-GPU data transfers. Secondly, a single-working-buffer strategy is employed to reduce the GPU memory usage, enabling more data to be stored on the GPU for reuse, resulting in increased temporal blocking steps. Experimental results demonstrate that the proposed method significantly reduces the GPU memory usage by 21%, thereby creating space for doubling the number of temporal blocking steps compared to the codes without compression. Our proposed method has shown to help the high-order, data-transfer-bound stencil codes achieve speedups up to \(2.09\times \) for single-precision floating-point format and up to \(1.92\times \) for double-precision floating-point format on an NVIDIA Tesla V100 GPU in comparison with the codes without compression.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Data Availability

Source codes of this work are available at https://github.com/mlvssc/compstencil.

References

  1. Serpa MS, Cruz EH, Diener M, Krause AM, Farrés A, Rosas C, Panetta J, Hanzich M, Navaux PO (2017) Strategies to improve the performance of a geophysics model for different manycore systems. In: 2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW). IEEE, pp 49–54

  2. Farres A, Rosas C, Hanzich M, Jordà M, Peña A (2019) Performance evaluation of fully anisotropic elastic wave propagation on nvidia volta gpus. In: 81st EAGE Conference and Exhibition 2019, vol 2019. European Association of Geoscientists & Engineers, pp 1–5

  3. Adams S, Payne J, Boppana R (2007) Finite difference time domain (fdtd) simulations using graphics processors. In: 2007 DoD High Performance Computing Modernization Program Users Group Conference. IEEE, pp 334–338

  4. Tabik S, Peemen M, Romero LF (2018) A tuning approach for iterative multiple 3d stencil pipeline on gpus: Anisotropic nonlinear diffusion algorithm as case study. J Supercomput 74(4):1580–1608

    Article  Google Scholar 

  5. Shen J, Shigeoka K, Ino F, Hagihara K (2019) Gpu-based branch-and-bound method to solve large 0–1 knapsack problems with data-centric strategies. Concurr Comput Pract Exp 31(4):4954

    Article  Google Scholar 

  6. Nogueira B, Tavares E, Araujo J, Callou G (2019) Accelerating continuous grasp with a gpu. J Supercomput 75(9):5741–5759

    Article  Google Scholar 

  7. Mousa MH, Hussein MK (2021) High-performance simplification of triangular surfaces using a GPU. PloS one 16(8):0255832

    Article  Google Scholar 

  8. Mousa MH, Hussein MK (2022) Surface approximation using GPU-based localized fourier transform. J King Saud Univ-Comput Inform Sci 34(4):1431–1438

    Google Scholar 

  9. Garcia-Molla VM, Alonso-Jordá P, García-Laguía R (2022) Parallel border tracking in binary images using GPUs. J Supercomput 78(7):9817–9839

    Article  Google Scholar 

  10. Mousa MH, Hussein MK (2022) Efficient UAV-based MEC using GPU-based PSO and voronoi diagrams. Comput Model Eng Sci 2022:10–32604

    Google Scholar 

  11. Alqarni MA, Mousa MH, Hussein MK (2022) Task offloading using GPU-based particle swarm optimization for high-performance vehicular edge computing. J King Saud Univ-Comput Inf Sci 34(10):10356–10364

    Google Scholar 

  12. Hussein MK, Mousa MH (2022) Efficient computation offloading of IoT-based workflows using discrete teaching learning-based optimization. Comput Mater Continua 73(2):3685–3703

    Article  Google Scholar 

  13. Jin G, Endo T, Matsuoka S (2013) A parallel optimization method for stencil computation on the domain that is bigger than memory capacity of gpus. In: 2013 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, pp 1–8

  14. Sourouri M, Baden SB, Cai X (2017) Panda: a compiler framework for concurrent cpu+ gpu execution of 3d stencil computations on gpu-accelerated supercomputers. Int J Parallel Program 45(3):711–729

    Article  Google Scholar 

  15. Shimokawabe T, Endo T, Onodera N, Aoki T (2017) A stencil framework to realize large-scale computations beyond device memory capacity on gpu supercomputers. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, pp 525–529

  16. Miki N, Ino F, Hagihara K (2019) PACC: a directive-based programming framework for out-of-core stencil computation on accelerators. Int J High Perform Comput Netw 13(1):19–34

    Article  Google Scholar 

  17. Shen J, Ino F, Farrés A, Hanzich M (2020) A data-centric directive-based framework to accelerate out-of-core stencil computation on a gpu. IEICE Trans Inf Sys 103(12):2421–2434

    Article  Google Scholar 

  18. Midorikawa H, Tan H (2015) Locality-aware stencil computations using flash SSDs as main memory extension. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, pp 1163–1168

  19. Allombert V, Michea D, Dupros F, Bellier C, Bourgine B, Aochi H, Jubertie S (2014) An out-of-core GPU approach for accelerating geostatistical interpolation. Procedia Comput Sci 29:888–896

    Article  Google Scholar 

  20. Zeidan M, Nazmy T, Aref M (2015) GPU-based Out-of-Core HLBVH Construction. In: EGSR (EI &I), pp 41–50

  21. Lee J, Kang H, Yeom H-J, Cheon S, Park J, Kim D (2021) Out-of-core GPU 2D-shift-FFT algorithm for ultra-high-resolution hologram generation. Optics Express 29(12):19094–19112

    Article  Google Scholar 

  22. Perepelkina A, Levchenko V, Zakirov A (2021) Extending the problem data size for GPU simulation beyond the GPU memory storage with LRnLA algorithms. J Phys Conf Series 1740(1):012054

    Article  Google Scholar 

  23. Zakirov AV, Korneev BA, Perepelkina AY (2022) Compact Update Algorithm for Numerical Schemes with Cross Stencil for Data Access Locality. In: Proceedings of the 2022 6th High Performance Computing and Cluster Technologies Conference, pp 51–58

  24. Cappello F, Di S, Gok AM (2020) Fulfilling the promises of lossy compression for scientific applications. In: Smoky Mountains Computational Sciences and Engineering Conference. Springer, pp 99–116

  25. Shen J, Wu Y, Okita M, Ino F (2022) Accelerating GPU-Based Out-of-Core Stencil Computation with On-the-Fly Compression. In: The 22nd International Conference on Parallel and Distributed Computing: Applications and Technologies (PDCAT), pp 3–14

  26. Matsumura K, Zohouri HR, Wahib M, Endo T, Matsuoka S (2020) An5d: automated stencil framework for high-degree temporal blocking on gpus. In: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization, pp 199–211

  27. Tao D, Di S, Liang X, Chen Z, Cappello F (2018) Improving performance of iterative methods by lossy checkponting. In: Proceedings of the 27th International Symposium on High-performance Parallel and Distributed Computing, pp 52–65

  28. Calhoun J, Cappello F, Olson LN, Snir M, Gropp WD (2019) Exploring the feasibility of lossy compression for pde simulations. Int J High Perform Comput Appl 33(2):397–410

    Article  Google Scholar 

  29. Jin S, Grosset P, Biwer CM, Pulido J, Tian J, Tao D, Ahrens J (2020) Understanding gpu-based lossy compression for extreme-scale cosmological simulations. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, pp 105–115

  30. Kriemann R, Ltaief H, Luong MB, Pérez FEH, Im HG, Keyes D (2022) High-Performance Spatial Data Compression for Scientific Applications. In: Proceedings of the 28th European Conference on Parallel Processing (Euro-Par), pp 403–418

  31. Wu X-C, Di S, Dasgupta EM, Cappello F, Finkel H, Alexeev Y, Chong FT (2019) Full-state quantum circuit simulation by using data compression. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–24

  32. Zhou Q, Chu C, Kumar N, Kousha P, Ghazimirsaeed S, Subramoni H, Panda D (2021) Designing high-performance mpi libraries with on-the-fly compression for modern gpu clusters. In: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, pp 444–453

  33. Sun G, Kang S, Jun S-W (2022) BurstZ+: eliminating the communication bottleneck of scientific computing accelerators via accelerated compression. ACM Trans Reconfigurable Technol Syst 15(2):1–34

    Article  Google Scholar 

  34. Lindstrom P (2014) Fixed-rate compressed floating-point arrays. IEEE Trans Visual Comput Graph 20(12):2674–2683

    Article  Google Scholar 

  35. Chen P, He S, Zhang X, Chen S, Hong P, Yin Y, Sun X-H (2022) Accelerating tensor swapping in gpus with self-tuning compression. IEEE Trans Parallel Distrib Syst Early Access 33(12):4484–4498

    Article  Google Scholar 

  36. NVIDIA Corporation (2017) NVIDIA Tesla V100 GPU architecture. http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

  37. Jia Z, Maggioni M, Staiger B, Scarpazza DP (2018) Dissecting the NVIDIA volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826

  38. NVIDIA Corporation (2022) CUDA Runtime API Reference Manual v11.6.2. https://docs.nvidia.com/cuda/pdf/CUDA_Runtime_API.pdf

  39. Huang T-W, Lin D-L, Lin C-X, Lin Y (2021) Taskflow: a lightweight parallel and heterogeneous task graph computing system. IEEE Trans Parallel Distrib Syst 33(6):1303–1320

    Article  Google Scholar 

  40. Huang T-W, Lin D-L, Lin Y, Lin C-X (2021) Taskflow: a general-purpose parallel and heterogeneous task programming system. IEEE Trans Computer-Aided Des Integr Circ Syst 41(5):1448–1452

    Article  Google Scholar 

  41. Kahn AB (1962) Topological sorting of large networks. Commun ACM 5(11):558–562

    Article  MATH  Google Scholar 

  42. Liu X, Liu Y, Yang H, Liao J, Li M, Luan Z, Qian D (2022) Toward accelerated stencil computation by adapting tensor core unit on GPU. In: Proceedings of the 36th ACM International Conference on Supercomputing (ICS), pp 1–12

  43. Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76

    Article  Google Scholar 

  44. Van der Pas R, Stotzer E, Terboven C (2017) Using OpenMP# the next step: affinity, accelerators, tasking, and simd. MIT press, Cambridge

    Google Scholar 

  45. Di S, Tao D, Liang X, Cappello F (2019) Efficient lossy compression for scientific data based on pointwise relative error bound. IEEE Trans Parallel Distrib Syst 30(2):331–345. https://doi.org/10.1109/TPDS.2018.2859932

    Article  Google Scholar 

Download references

Acknowledgements

This study was supported in part by the Japan Society for the Promotion of Science KAKENHI under grant 20K21794, grants from the National Natural Science Foundation of China 61902045 and 62172067, Chongqing High-Tech Research Key Program cstc2021jcyj-msxmX0981, cstc2021jcyj-msxmX0530, and cstb2022nscq-msx0601.

Funding

KAKENHI grant 20K21794 from the Japan Society for the Promotion of Science, grants from the National Natural Science Foundation of China 61902045 and 62172067, Chongqing High-Tech Research Key Program cstc2021jcyj-msxmX0981, cstc2021jcyj-msxmX0530, and cstb2022nscq-msx0601.

Author information

Authors and Affiliations

Authors

Contributions

JS conducted the experiments and wrote the manuscript. LL has done a considerable amount of work to evaluate the work and to help revise the manuscript. FI partially supervised this study. XD and MO gave revising suggestions. All authors have reviewed the manuscript.

Corresponding author

Correspondence to Linbo Long.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shen, J., Long, L., Deng, X. et al. A compression-based memory-efficient optimization for out-of-core GPU stencil computation. J Supercomput 79, 11055–11077 (2023). https://doi.org/10.1007/s11227-023-05103-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-023-05103-8

Keywords

Navigation