Skip to main content
Log in

Memory Access Optimized Implementation of Cyclic and Quasi-Cyclic LDPC Codes on a GPGPU

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Software based decoding of low-density parity-check (LDPC) codes frequently takes very long time, thus the general purpose graphics processing units (GPGPUs) that support massively parallel processing can be very useful for speeding up the simulation. In LDPC decoding, the parity-check matrix H needs to be accessed at every node updating process, and the size of the matrix is often larger than that of GPU on-chip memory especially when the code length is long or the weight is high. In this work, the parity-check matrix of cyclic or quasi-cyclic (QC) LDPC codes is greatly compressed by exploiting the periodic property of the matrix. Also, vacant elements are eliminated from the sparse message arrays to utilize the coalesced access of global memory supported by GPGPUs. Regular projective geometry (PG) and irregular QC LDPC codes are used for sum-product algorithm based decoding with the GTX-285 NVIDIA graphics processing unit (GPU), and considerable speed-up results are obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. Segment size is 32, 64, and 128 bytes for 8-bit, 16-bit, and 32-, 64- and 128-bit data, respectively

  2. The compute capability of a device is defined by a major and minor revision number. Devices with the same major revision number are of the same core architecture. The minor revision number corresponds to an incremental improvement to the core architecture, possibly including new features. The version of GTX-200 series is 1.3.

  3. Block dimension is the number of threads that constitute one thread block.

  4. The maximum number of threads per thread block is 512.

  5. The index calculation is described in Section 3.3 in detail.

References

  1. Gallager, R. G. (1963). Low density parity check codes. Cambridge: MIT.

    Google Scholar 

  2. The Digital Video Broadcasting Standard [Online]. Available: www.dvb.org

  3. The IEEE 802.16 Working Group [Online]. Available: http://www.ieee802.org/16/

  4. The IEEE 802.11n Working Group [Online]. Available: http://www.ieee802.org/11/

  5. Falcão, G., Silva, V., & Sousa L. (2009). How GPUs can outperform ASICSs for fast LDPC decoding. In Proc. of the 2third International Conference on Supercomputing, New York, USA, pp. 390–399

  6. Falcão, G., Yamagiwa, S., Silva, V., & Sousa, L. (2009). Parallel LDPC decoding on GPUs using a stream-based computing approach. Journal of Computer Science and Technology, 24, 913–924.

    Article  Google Scholar 

  7. Tanner, R. M. (1981). A recursive approach to low complexity codes. IEEE Transactions on Information Theory, IT-27, 533–547.

    Article  MathSciNet  Google Scholar 

  8. Kou, Y., Lin, S., & Fossorier, M. (2001). Low density parity check codes based on finite geometries: a rediscovery and more. IEEE Transactions on Information Theory, 47, 2711–2736.

    Article  MathSciNet  MATH  Google Scholar 

  9. MacKay, D. J. C. (1999). Good error-correcting codes based on very sparse matrices. IEEE Transactions on Information Theory, 45, 399–431.

    Article  MathSciNet  MATH  Google Scholar 

  10. Chen, J., Dholakia, A., Eleftheriou, E., Fossorier, M., & Hu, X. Y. (2002). Near optimal reduced-complexity decoding algorithms for LDPC codes. In Proc. IEEE Int. Symp. Information Theory, Lausanne, Switzerland, p. 455

  11. The CUDA Programming Guide [Online]. Available: http://developer.NVIDIA.com/object/cuda.html

  12. Bell, N., & Garland, M. (2008). Efficient Sparse Matrix-Vector Multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation.

  13. Im, E. (2000). Optimizing the performance of sparse matrix-vector multiplication. Technical Report, UMI Order Number: CSD-00-1104., University of California at Berkeley.

Download references

Acknowledgements

This work was supported in part by the National Research Foundation (NRF) grant funded by the Korea government (MEST) (No. 20090075770 and No. 20090084804) and in part by the MEST under the Brain Korea 21 Project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hyunwoo Ji.

Additional information

This work is an improved version of the “Massively parallel implementation of cyclic LDPC codes on a general purpose graphic processing unit,” which was presented in the IEEE Workshop on Signal Processing Systems (SiPS) held in Tampere (Finland) in 2009. Implementation results of standardized irregular QC LDPC codes for Wi-Fi and WiMax are added, and a two-dimensional message array compression technique is included.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ji, H., Cho, J. & Sung, W. Memory Access Optimized Implementation of Cyclic and Quasi-Cyclic LDPC Codes on a GPGPU. J Sign Process Syst 64, 149–159 (2011). https://doi.org/10.1007/s11265-010-0547-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-010-0547-9

Keywords

Navigation