Skip to main content
Log in

Optimising lossless stages in a GPU-based MPEG encoder

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Modern GPUs excel in parallel computations, so they are an interesting target to perform matrix transformations such as the DCT, a fundamental part of MPEG video coding algorithms. Considering a system to encode synthetic video (e.g., computer-generated frames), this approach becomes even more appealing, since the images to encode are already in the GPU, eliminating the costs of transferring raw video from the CPU to the GPU. However, after a raw frame has been transformed and quantized by the GPU, the resulting coefficients must be reordered, entropy encoded and framed into the resulting MPEG bitstream. These last steps are essentially sequential and their straightforward GPU implementation is inefficient compared to CPU-based implementations. We present different approaches to implement part of these steps in GPU, aiming for a better usage of the memory bus, compensating the suboptimal use of the GPU with the gains in transfer time. We analyze three approaches to perform the zigzag scan and Huffman coding combining GPU and CPU, and two approaches to assemble the results to build the actual output bitstream both in GPU and CPU memory. Our experiments show that optimising the amount of data transferred from GPU to CPU implementing the last sequential compression steps in the GPU, and using a parallel fast scan implementation of the zigzag scanning improve the overall performance of the system. Savings in transfer time outweigh the extra cost incurred in the GPU.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

References

  1. Aqrawi AA, Elster AC (2010) Accelerating disk access using compression for large seismic datasets on modern gpu and cpu. In: Para 2010 state of the art in scientific and parallel computing, extended abstract no. 131

  2. Balevic A (2010) Parallel variable-length encoding on gpgpus. In: Proceedings of the 2009 international conference on Parallel processing, Euro-Par’09. Springer, Berlin, Heidelberg, pp 26–35. http://dl.acm.org/citation.cfm?id=1884795.1884802

    Chapter  Google Scholar 

  3. De Neve W, Van Rijsselbergen D, Hollemeersch C, De Cock J, Notebaert S, Van de Walle R (2005) GPU-assisted decoding of video samples represented in the YCoCg-R color space. In: MULTIMEDIA ’05: proceedings of the 13th annual ACM international conference on Multimedia. ACM, New York, NY, USA, pp 447–450. doi:10.1145/1101149.1101248

    Chapter  Google Scholar 

  4. Dotsenko Y, Govindaraju, NK, Sloan PP, Boyd C, Manferdelli J (2008) Fast scan algorithms on graphics processors. In: ICS ’08: proceedings of the 22nd annual international conference on Supercomputing. ACM, New York, NY, USA, pp 205–213. doi:10.1145/1375527.1375559

    Chapter  Google Scholar 

  5. Fung J, Mann S (2004) Computer vision signal processing on graphics processing units. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP ’04), vol 5, pp 93–96. doi:10.1109/ICASSP.2004.1327055

  6. Ho CW, Au O, Gary Chan SH, Yip SK, Wong HM (2006) Motion estimation for H.264/AVC using programmable graphics hardware. In: IEEE international conference on multimedia and expo. IEEE, pp 2049–2052. doi:10.1109/ICME.2006.262617

  7. ISO/IEC (1995) Generic coding of moving pictures and associated audio information. International Standard 13818-2, International Organisation for Standardisation ISO/IEC

  8. Kresch R, Merhav N (1999) Fast DCT domain filtering using the DCT and the DST. IEEE Trans Image Process 8(6):821–833. doi:10.1109/83.766859

    Article  Google Scholar 

  9. Luebke D, Harris M, Govindaraju N, Lefohn A, Houston M, Owens J, Segal M, Papakipos M, Buck I (2006) Gpgpu: general-purpose computation on graphics hardware. In: SC ’06: proceedings of the 2006 ACM/IEEE conference on supercomputing. ACM, New York, NY, USA, p 208. doi:10.1145/1188455.1188672

    Chapter  Google Scholar 

  10. Munshi A (2008) OpenCL Parallel Computing on the GPU and CPU. http://www.multicoreinfo.com/research/slides/OpenCL-slides.pdf

  11. Nickolls J, Buck I, Garland M, Skadron K (2008) Scalable parallel programming with CUDA. Queue 6(2):40–53. doi:10.1145/1365490.1365500

    Article  Google Scholar 

  12. NVIDIA (2007) CUDA. Programming guide, NVIDIA Corporation

  13. Obukhov A, Kharlamov A (2008) Discrete cosine transform for 8 × 8 blocks with CUDA. Tech. rep., NVIDIA

  14. Ryoo S, Rodrigues CI, Baghsorkhi SS, Stone SS, Kirk DB, Hwu WmW (2008) Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In: PPoPP’08: proceedings of the 13th ACM SIGPLAN symposium on principles and practice of parallel programming. ACM, New York, NY, USA, pp 73–82. doi:10.1145/1345206.1345220

    Chapter  Google Scholar 

  15. Shen G, Gao GP, Li S, Shum HY, Zhang YQ (2005) Accelerate video decoding with generic gpu. IEEE Trans Circuits Syst Video Technol 15(5):685–693. doi:10.1109/TCSVT.2005.846440

    Article  Google Scholar 

  16. Strzodka R, Garbe C (2004) Real-time motion estimation and visualization on graphics cards. In: Visualization. IEEE, pp 545–552. doi:10.1109/VISUAL.2004.88

  17. The OpenCL Specification. Version 1.0 (2009) The Khronos Group

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Javier Taibo.

Additional information

This work was partly supported by Xunta de Galicia grants PGIDIT 07TIC005105PR and 09TIC015CT and Spanish Ministry of Science and Innovation grant TIN2010-20959.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Montero, P., Gulías, V.M., Taibo, J. et al. Optimising lossless stages in a GPU-based MPEG encoder. Multimed Tools Appl 65, 495–520 (2013). https://doi.org/10.1007/s11042-012-1053-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-012-1053-9

Keywords

Navigation