ABSTRACT
Lossy compression is often deployed in scientific applications to reduce data footprint and improve data transfers and I/O performance. Especially for applications requiring on-the-flight compression, it is essential to minimize compression's runtime. In this paper, we design a scheme to improve the performance of cuSZ, a GPU-based lossy compressor. We observe that Huffman coding - used by cuSZ to compress metadata generated during compression - incurs a performance overhead that can be significant, especially for smaller datasets. Our work seeks to reduce the Huffman coding runtime with minimal-to-no impact on cuSZ's compression efficiency.
Our contributions are as follows. First, we examine a variety of probability distributions to determine which distributions closely model the input to cuSZ's Huffman coding stage. From these distributions, we create a dictionary of pre-computed codebooks such that during compression, a codebook is selected from the dictionary instead of computing a custom codebook. Second, we explore three codebook selection criteria to be applied at runtime. Finally, we evaluate our scheme on real-world datasets and in the context of two important application use cases, HDF5 and MPI, using an NVIDIA A100 GPU. Our evaluation shows that our method can reduce the Huffman coding penalty by a factor of 78--92×, translating to a total speedup of up to 5× over baseline cuSZ. Smaller HDF5 chunk sizes enjoy over an 8× speedup in compression and MPI messages on the scale of tens of MB have a 1.4--30.5× speedup in communication time.
- Bulent Abali, Bartholomew Balner, Hubertus Franke, and John J. Reilly. 2017. Creating a dynamic Huffman table.Google Scholar
- M. Ainsworth, O. Tugluk, B. Whitney, and S. Klasky. 2017. MGARD: A Multilevel Technique for Compression of Floating-Point Data. In DRBSD-2 Workshop at Supercomputing.Google Scholar
- BlosC compressor. [n. d.]. http://blosc.org/. Online.Google Scholar
- M. Burtscher and P. Ratanaworabhan. 2009. FPC: A High-Speed Compressor for Double-Precision Floating-Point Data. IEEE Trans. Comput. 58, 1 (Jan 2009), 18--31. Google ScholarDigital Library
- Franck Cappello, Sheng Di, Sihuan Li, Xin Liang, Ali Murat Gok, Dingwen Tao, Chun Hong Yoon, Xin-Chuan Wu, Yuri Alexeev, and Frederic T Chong. 2019. Use cases of lossy compression for floating-point data in scientific data sets. The International Journal of High Performance Computing Applications 33, 6 (2019), 1201--1220. arXiv:https://doi.org/10.1177/1094342019853336 Google ScholarDigital Library
- Yann Collet. 2015. Zstandard - Real-time data compression algorithm. http://facebook.github.io/zstd/ (2015).Google Scholar
- HDF5. [n. d.]. https://portal.hdfgroup.org/display/HDF5/HDF5. Online.Google Scholar
- David A. Huffman. 1952. A Method for the Construction of Minimum-Redundancy Codes. Proceedings of the IRE 40, 9 (1952), 1098--1101. Google ScholarCross Ref
- Sian Jin, Dingwen Tao, Houjun Tang, Sheng Di, Suren Byna, Zarija Lukic, and Franck Cappello. 2022. Accelerating Parallel Write via Deeply Integrating Predictive Lossy Compression with HDF5. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Dallas, Texas) (SC '22). IEEE Press, Article 61, 15 pages.Google ScholarDigital Library
- Xin Liang, Sheng Di, Dingwen Tao, Sihuan Li, Shaomeng Li, Hanqi Guo, Zizhong Chen, and Franck Cappello. 2018. Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets. In IEEE Big Data. 438--447. Google ScholarCross Ref
- OpenMPI. [n. d.]. https://www.open-mpi.org/. Online.Google Scholar
- SA Ostadzadeh, B Maryam Elahi, ZZ Tabrizi, M Amir Moulavi, and K Bertels. 2007. A two-phase practical parallel algorithm for construction of huffman codes. In PDPTA 2007. CSREA Press, 284--291.Google Scholar
- Ritesh A. Patel, Yao Zhang, Jason Mak, Andrew Davidson, and John D. Owens. 2012. Parallel lossless data compression on the GPU. In 2012 Innovative Parallel Computing (InPar). 1--9. Google ScholarCross Ref
- Roman Schutski, Danil Lykov, and Ivan Oseledets. 2020. Adaptive algorithm for quantum circuit simulation. Phys. Rev. A 101 (Apr 2020), 042335. Issue 4. Google ScholarCross Ref
- Eugene S. Schwartz and Bruce Kallick. 1964. Generating a Canonical Prefix Encoding. Commun. ACM 7, 3 (mar 1964), 166--169. Google ScholarDigital Library
- Jiannan Tian, Sheng Di, Xiaodong Yu, Cody Rivera, Kai Zhao, Sian Jin, Yunhe Feng, Xin Liang, Dingwen Tao, and Franck Cappello. 2021. Optimizing Error-Bounded Lossy Compression for Scientific Data on GPUs. In 2021 IEEE International Conference on Cluster Computing (CLUSTER). 283--293. Google ScholarCross Ref
- Jiannan Tian and et al. 2020. cuSZ: An Efficient GPU-Based Error-Bounded Lossy Compression Framework for Scientific Data (PACT '20). Association for Computing Machinery, New York, NY, USA, 3--15. Google ScholarDigital Library
- Jiannan Tian, Cody Rivera, Sheng Di, Jieyang Chen, Xin Liang, Dingwen Tao, and Franck Cappello. 2021. Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 881--891. Google ScholarCross Ref
- Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedrcegosa, Paul van Mulbregt, and SciPy 1.0 Contributors. 2020. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17 (2020), 261--272. Google ScholarCross Ref
- Chengming Zhang, Sian Jin, Tong Geng, Jiannan Tian, Ang Li, and Dingwen Tao. 2022. CEAZ: Accelerating Parallel I/O via Hardware-Algorithm Co-Designed Adaptive Lossy Compression. In Proceedings of the 36th ACM International Conference on Supercomputing (Virtual Event) (ICS '22). Association for Computing Machinery, New York, NY, USA, Article 12, 13 pages. Google ScholarDigital Library
- Kai Zhao, Sheng Di, Maxim Dmitriev, Thierry-Laurent D. Tonellot, Zizhong Chen, and Franck Cappello. 2021. Optimizing Error-Bounded Lossy Compression for Scientific Data by Dynamic Spline Interpolation. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). 1643--1654. Google ScholarCross Ref
- Kai Zhao, Sheng Di, Xin Lian, Sihuan Li, Dingwen Tao, Julie Bessac, Zizhong Chen, and Franck Cappello. 2020. SDRBench: Scientific Data Reduction Benchmark for Lossy Compressors. In 2020 IEEE International Conference on Big Data (Big Data). 2716--2724. Google ScholarCross Ref
- Q. Zhou, C. Chu, N. S. Kumar, P. Kousha, S. M. Ghazimirsaeed, H. Subramoni, and D. K. Panda. 2021. Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 444--453. Google ScholarCross Ref
- Zlib. [n. d.]. https://www.zlib.net/. Online.Google Scholar
Index Terms
- Lightweight Huffman Coding for Efficient GPU Compression
Recommendations
Forward Looking Huffman Coding
AbstractHuffman coding is known to be optimal, yet its dynamic version may yield smaller compressed files. The best known bound is that the number of bits used by dynamic Huffman coding in order to encode a message of n characters is at most larger by n ...
Enhanced Huffman Coding with Encryption for Wireless Data Broadcasting System
IS3C '12: Proceedings of the 2012 International Symposium on Computer, Consumer and ControlData compression has been playing an important role in the areas of data transmission. Many great contributions have been made in this area, such as Huffman coding, LZW algorithm, run length coding, and so on. These methods only focus on the data ...
Parallel Zigzag Scanning and Huffman Coding for a GPU-based MPEG-2 Encoder
ISM '10: Proceedings of the 2010 IEEE International Symposium on MultimediaGPUs excel in parallel computations, so they are very efficient calculating the discrete cosine transform of spatial domain images, as required for video encoding. The last steps of MPEG-2 compression, however, are inherently sequential since they ...
Comments