skip to main content
10.1145/3404397.3404429acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Huffman Coding with Gap Arrays for GPU Acceleration

Published:17 August 2020Publication History

ABSTRACT

Huffman coding is a fundamental lossless data compression scheme used in many data compression file formats such as gzip, zip, png, and jpeg. Huffman encoding is easily parallelized, because all 8-bit symbols can be converted into codewords independently. On the other hand, since an encoded codeword sequence has no separator to identify each codeword, parallelizing Huffman decoding is a much harder task. This work presents a new data structure called gap array to be attached to an encoded codeword sequence of Huffman coding for accelerating parallel Huffman decoding. In addition, it also shows that GPU Huffman encoding and decoding can be accelerated by several techniques including (1) the Single Kernel Soft Synchronization (SKSS), (2) wordwise global memory access and (3) compact codebooks. The experimental results for 10 files on NVIDIA Tesla V100 GPU show that our GPU Huffman encoding and decoding run 2.87x-7.70x times and 1.26x-2.63x times faster than previously presented GPU Huffman encoding and decoding, respectively. Also, Huffman decoding can be further accelerated by a factor of 1.67x-6450x if a gap array is attached to an encoded codeword sequence. Since the size and computing overhead of gap arrays in Huffman encoding are small, we can conclude that gap arrays should be introduced for GPU Huffman encoding and decoding.

References

  1. P. Deutsch. 1996. DEFLATE Compressed Data Format Specification version 1.3. https://www.rfc-editor.org/info/rfc1951.Google ScholarGoogle Scholar
  2. Yutaro Emoto, Shunji Funasaka, Hiroki Tokura, Takumi Honda, Koji Nakano, and Yasuaki Ito. 2018. An Optimal Parallel Algorithm for Computing the Summed Area Table on the GPU. In Proc. of International Parallel and Distributed Processing Symposium Workshops. 763–772.Google ScholarGoogle ScholarCross RefCross Ref
  3. T. Ferguson and J. Rabinowitz. 1984. Self-synchronizing Huffman codes (Corresp.). IEEE Trans. on Information Theory 30, 4 (July 1984), 687 – 693.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Antonio Fuentes-Alventosa, Juan Gómez-Luna ; JoséM González-Linares, and Nicolás Guil. 2014. CUVLE: Variable-Length Encoding on CUDA. In Proc. Conference on Design and Architectures for Signal and Image Processing. 1–6.Google ScholarGoogle ScholarCross RefCross Ref
  5. Shunji Funasaka, Koji Nakano, and Yasuaki Ito. 2015. Fast LZW compression using a GPU. In Proc. of International Symposium on Computing and Networking. 303–308.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Shunji Funasaka, Koji Nakano, and Yasuaki Ito. 2016. Fully Parallelized LZW Decompression for CUDA-Enabled GPUs. IEICE Transactions on Information and Systems 99-D, 12 (Dec. 2016), 2986–2994.Google ScholarGoogle Scholar
  7. Shunji Funasaka, Koji Nakano, and Yasuaki Ito. 2016. Light Loss-Less Data Compression, with GPU Implementation. In Proc. of International Conference on Algorithms and Architectures for Parallel Processing (LNCS 10048). 281–294.Google ScholarGoogle ScholarCross RefCross Ref
  8. Shunji Funasaka, Koji Nakano, and Yasuaki Ito. 2017. Adaptive loss-less data compression method optimized for GPU decompression. Concurrency and Computation: Practice and Experience 29, 24(2017), e4283.Google ScholarGoogle ScholarCross RefCross Ref
  9. Shunji Funasaka, Koji Nakano, and Yasuaki Ito. 2017. Single Kernel Soft Synchronization Technique for Task Arrays on CUDA-enabled GPUs, with Applications. In Proc. International Symposium on Networking and Computing. pp.11–20.Google ScholarGoogle ScholarCross RefCross Ref
  10. Mark Harris, Shubhabrata Sengupta, and John D. Owens. 2007. Chapter 39. Parallel Prefix Sum (Scan) with CUDA. In GPU Gems 3. Addison-Wesley, 851––876.Google ScholarGoogle Scholar
  11. Takumi Honda, Shinnosuke Yamamoto, Hiroaki Honda, Koji Nakano, and Yasuaki Ito. 2017. Simple and Fast Parallel Algorithms for the Voronoi Map and the Euclidean Distance Map, with GPU Implementations. In Proc. of International Conference on Parallel Processing. 362–371.Google ScholarGoogle ScholarCross RefCross Ref
  12. David A. Huffman. 1952. A Method for the Construction of Minimum-Redundancy Codes. In Proc. of the IRE, Vol. 40. 1098 – 1101.Google ScholarGoogle ScholarCross RefCross Ref
  13. Wen-mei W. Hwu. 2011. GPU Computing Gems Emerald Edition. Morgan Kaufmann.Google ScholarGoogle Scholar
  14. ISO. 1994. ISO/IEC 10918-1:1994. https://www.iso.org/standard/18902.html.Google ScholarGoogle Scholar
  15. Jyrki Katajainen, Alistair Moffat, and Andrew Turpin. 1995. A Fast and Space - Economical Algorithm for Length - Limited Coding. In Proc. of International Symposium on Algorithms and Computation. 12–21.Google ScholarGoogle ScholarCross RefCross Ref
  16. S. T. Klein and Y. Wiseman. 2003. Parallel Huffman Decoding with Applications to JPEG Files. Comput. J. 46, 5 (Jan. 2003), 487 – 497.Google ScholarGoogle Scholar
  17. Lawrence L. Larmore and Daniel S. Hirschberg. 1990. A fast algorithm for optimal length-limited Huffman codes. J. ACM 37, 3 (July 1990), 464–473.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Duane Merrill. 2017. CUB : A library of warp-wide, block-wide, and device-wide GPU parallel primitives. https://nvlabs.github.io/cub/.Google ScholarGoogle Scholar
  19. Duane Merrill and Michael Garland. 2016. Single-pass Parallel Prefix Scan with Decoupled Look-back. Technical Report NVR-2016-002. NVIDIA.Google ScholarGoogle Scholar
  20. NVIDIA Corporation. 2017. NVIDIA TESLA V100 GPU ARCHITECTURE. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.Google ScholarGoogle Scholar
  21. NVIDIA Corporation. 2019. NVIDIA CUDA C++ Programming Guide Version 10.2.Google ScholarGoogle Scholar
  22. Kohei Ogawa, Yasuaki Ito, and Koji Nakano. 2010. Efficient Canny Edge Detection Using a GPU. In Proc. of International Conference on Networking and Computing. IEEE CS Press, 279–280.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Adnan Ozsoy and Martin Swany. 2011. CULZSS: LZSS Lossless Data Compression on CUDA. In Proc. International Conference on Cluster Computing. 403 – 411.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ritesh A. Patel, Yao Zhang, Jason Mak, and Andrew Davidson. 2012. Parallel lossless data compression on the GPU. In Proc. of Innovative Parallel Computing (InPar). 1–9.Google ScholarGoogle ScholarCross RefCross Ref
  25. Habibelahi Rahmani, Cihan Topal, and Cuneyt Akinlar. 2014. A parallel Huffman coder on the CUDA architecture. In Proc. of IEEE Visual Communications and Image Processing Conference. 311–314.Google ScholarGoogle ScholarCross RefCross Ref
  26. Greg Roelofs, Jean loup Gailly, and Mark Adler. 2006. Zlib - Technical Details. https://www.zlib.net/zlib_tech.html.Google ScholarGoogle Scholar
  27. Evangelia Sitaridi, Rene Mueller, Tim Kaldewey, Guy Lohman, and Kenneth A. Ross. 2016. Massively-Parallel Lossless Data Decompression. In Proc. of Internatinal Conference on Parallel Processing. 242–247.Google ScholarGoogle ScholarCross RefCross Ref
  28. Hiroki Tokura, Toru Fujita, Koji Nakano, Yasuaki Ito, and Jacir Luiz Bordim. 2018. Almost optimal column-wise prefix-sum computation on the GPU. The Journal of Supercomputing 74, 4 (2018), 1510–1521.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. André Weissenberger. 2018. CUHD - A Massively Parallel Huffman Decoder. https://github.com/weissenberger/gpuhd.Google ScholarGoogle Scholar
  30. André Weissenberger and Bertil Schmidt. 2018. Massively Parallel Huffman Decoding on GPUs. In Proc. of International Conference on Parallel Processing. 1–10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Kohei Yamashita, Yasuaki Ito, and Koji Nakano. 2019. Bulk execution of the dynamic programming for the optimal polygon triangulation problem on the GPU. Concurrency and Computation: Practice and Experience 31, 1(2019), e4947.Google ScholarGoogle ScholarCross RefCross Ref

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    ICPP '20: Proceedings of the 49th International Conference on Parallel Processing
    August 2020
    844 pages
    ISBN:9781450388160
    DOI:10.1145/3404397

    Copyright © 2020 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 17 August 2020

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate91of313submissions,29%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format