ABSTRACT
Huffman coding is a fundamental lossless data compression scheme used in many data compression file formats such as gzip, zip, png, and jpeg. Huffman encoding is easily parallelized, because all 8-bit symbols can be converted into codewords independently. On the other hand, since an encoded codeword sequence has no separator to identify each codeword, parallelizing Huffman decoding is a much harder task. This work presents a new data structure called gap array to be attached to an encoded codeword sequence of Huffman coding for accelerating parallel Huffman decoding. In addition, it also shows that GPU Huffman encoding and decoding can be accelerated by several techniques including (1) the Single Kernel Soft Synchronization (SKSS), (2) wordwise global memory access and (3) compact codebooks. The experimental results for 10 files on NVIDIA Tesla V100 GPU show that our GPU Huffman encoding and decoding run 2.87x-7.70x times and 1.26x-2.63x times faster than previously presented GPU Huffman encoding and decoding, respectively. Also, Huffman decoding can be further accelerated by a factor of 1.67x-6450x if a gap array is attached to an encoded codeword sequence. Since the size and computing overhead of gap arrays in Huffman encoding are small, we can conclude that gap arrays should be introduced for GPU Huffman encoding and decoding.
- P. Deutsch. 1996. DEFLATE Compressed Data Format Specification version 1.3. https://www.rfc-editor.org/info/rfc1951.Google Scholar
- Yutaro Emoto, Shunji Funasaka, Hiroki Tokura, Takumi Honda, Koji Nakano, and Yasuaki Ito. 2018. An Optimal Parallel Algorithm for Computing the Summed Area Table on the GPU. In Proc. of International Parallel and Distributed Processing Symposium Workshops. 763–772.Google ScholarCross Ref
- T. Ferguson and J. Rabinowitz. 1984. Self-synchronizing Huffman codes (Corresp.). IEEE Trans. on Information Theory 30, 4 (July 1984), 687 – 693.Google ScholarDigital Library
- Antonio Fuentes-Alventosa, Juan Gómez-Luna ; JoséM González-Linares, and Nicolás Guil. 2014. CUVLE: Variable-Length Encoding on CUDA. In Proc. Conference on Design and Architectures for Signal and Image Processing. 1–6.Google ScholarCross Ref
- Shunji Funasaka, Koji Nakano, and Yasuaki Ito. 2015. Fast LZW compression using a GPU. In Proc. of International Symposium on Computing and Networking. 303–308.Google ScholarDigital Library
- Shunji Funasaka, Koji Nakano, and Yasuaki Ito. 2016. Fully Parallelized LZW Decompression for CUDA-Enabled GPUs. IEICE Transactions on Information and Systems 99-D, 12 (Dec. 2016), 2986–2994.Google Scholar
- Shunji Funasaka, Koji Nakano, and Yasuaki Ito. 2016. Light Loss-Less Data Compression, with GPU Implementation. In Proc. of International Conference on Algorithms and Architectures for Parallel Processing (LNCS 10048). 281–294.Google ScholarCross Ref
- Shunji Funasaka, Koji Nakano, and Yasuaki Ito. 2017. Adaptive loss-less data compression method optimized for GPU decompression. Concurrency and Computation: Practice and Experience 29, 24(2017), e4283.Google ScholarCross Ref
- Shunji Funasaka, Koji Nakano, and Yasuaki Ito. 2017. Single Kernel Soft Synchronization Technique for Task Arrays on CUDA-enabled GPUs, with Applications. In Proc. International Symposium on Networking and Computing. pp.11–20.Google ScholarCross Ref
- Mark Harris, Shubhabrata Sengupta, and John D. Owens. 2007. Chapter 39. Parallel Prefix Sum (Scan) with CUDA. In GPU Gems 3. Addison-Wesley, 851––876.Google Scholar
- Takumi Honda, Shinnosuke Yamamoto, Hiroaki Honda, Koji Nakano, and Yasuaki Ito. 2017. Simple and Fast Parallel Algorithms for the Voronoi Map and the Euclidean Distance Map, with GPU Implementations. In Proc. of International Conference on Parallel Processing. 362–371.Google ScholarCross Ref
- David A. Huffman. 1952. A Method for the Construction of Minimum-Redundancy Codes. In Proc. of the IRE, Vol. 40. 1098 – 1101.Google ScholarCross Ref
- Wen-mei W. Hwu. 2011. GPU Computing Gems Emerald Edition. Morgan Kaufmann.Google Scholar
- ISO. 1994. ISO/IEC 10918-1:1994. https://www.iso.org/standard/18902.html.Google Scholar
- Jyrki Katajainen, Alistair Moffat, and Andrew Turpin. 1995. A Fast and Space - Economical Algorithm for Length - Limited Coding. In Proc. of International Symposium on Algorithms and Computation. 12–21.Google ScholarCross Ref
- S. T. Klein and Y. Wiseman. 2003. Parallel Huffman Decoding with Applications to JPEG Files. Comput. J. 46, 5 (Jan. 2003), 487 – 497.Google Scholar
- Lawrence L. Larmore and Daniel S. Hirschberg. 1990. A fast algorithm for optimal length-limited Huffman codes. J. ACM 37, 3 (July 1990), 464–473.Google ScholarDigital Library
- Duane Merrill. 2017. CUB : A library of warp-wide, block-wide, and device-wide GPU parallel primitives. https://nvlabs.github.io/cub/.Google Scholar
- Duane Merrill and Michael Garland. 2016. Single-pass Parallel Prefix Scan with Decoupled Look-back. Technical Report NVR-2016-002. NVIDIA.Google Scholar
- NVIDIA Corporation. 2017. NVIDIA TESLA V100 GPU ARCHITECTURE. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.Google Scholar
- NVIDIA Corporation. 2019. NVIDIA CUDA C++ Programming Guide Version 10.2.Google Scholar
- Kohei Ogawa, Yasuaki Ito, and Koji Nakano. 2010. Efficient Canny Edge Detection Using a GPU. In Proc. of International Conference on Networking and Computing. IEEE CS Press, 279–280.Google ScholarDigital Library
- Adnan Ozsoy and Martin Swany. 2011. CULZSS: LZSS Lossless Data Compression on CUDA. In Proc. International Conference on Cluster Computing. 403 – 411.Google ScholarDigital Library
- Ritesh A. Patel, Yao Zhang, Jason Mak, and Andrew Davidson. 2012. Parallel lossless data compression on the GPU. In Proc. of Innovative Parallel Computing (InPar). 1–9.Google ScholarCross Ref
- Habibelahi Rahmani, Cihan Topal, and Cuneyt Akinlar. 2014. A parallel Huffman coder on the CUDA architecture. In Proc. of IEEE Visual Communications and Image Processing Conference. 311–314.Google ScholarCross Ref
- Greg Roelofs, Jean loup Gailly, and Mark Adler. 2006. Zlib - Technical Details. https://www.zlib.net/zlib_tech.html.Google Scholar
- Evangelia Sitaridi, Rene Mueller, Tim Kaldewey, Guy Lohman, and Kenneth A. Ross. 2016. Massively-Parallel Lossless Data Decompression. In Proc. of Internatinal Conference on Parallel Processing. 242–247.Google ScholarCross Ref
- Hiroki Tokura, Toru Fujita, Koji Nakano, Yasuaki Ito, and Jacir Luiz Bordim. 2018. Almost optimal column-wise prefix-sum computation on the GPU. The Journal of Supercomputing 74, 4 (2018), 1510–1521.Google ScholarDigital Library
- André Weissenberger. 2018. CUHD - A Massively Parallel Huffman Decoder. https://github.com/weissenberger/gpuhd.Google Scholar
- André Weissenberger and Bertil Schmidt. 2018. Massively Parallel Huffman Decoding on GPUs. In Proc. of International Conference on Parallel Processing. 1–10.Google ScholarDigital Library
- Kohei Yamashita, Yasuaki Ito, and Koji Nakano. 2019. Bulk execution of the dynamic programming for the optimal polygon triangulation problem on the GPU. Concurrency and Computation: Practice and Experience 31, 1(2019), e4947.Google ScholarCross Ref
Recommendations
Lightweight Huffman Coding for Efficient GPU Compression
ICS '23: Proceedings of the 37th International Conference on SupercomputingLossy compression is often deployed in scientific applications to reduce data footprint and improve data transfers and I/O performance. Especially for applications requiring on-the-flight compression, it is essential to minimize compression's runtime. ...
Forward Looking Huffman Coding
Computer Science – Theory and ApplicationsAbstractHuffman coding is known to be optimal, yet its dynamic version may yield smaller compressed files. The best known bound is that the number of bits used by dynamic Huffman coding in order to encode a message of n characters is at most larger by n ...
Forward Looking Huffman Coding
AbstractHuffman coding is known to be optimal, yet its dynamic version may yield smaller compressed files. The best known bound is that the number of bits used by dynamic Huffman coding in order to encode a message of n characters is at most larger by n ...
Comments