Abstract
H.264/AVC video encoders have been widely used for its high coding efficiency. Since the computational demand proportional to the frame resolution is constantly increasing, it has been of great interest to accelerate H.264/AVC by parallel processing. Recently, graphics processing units (GPUs) have emerged as a viable target for accelerating general purpose applications by exploiting fine-grain data parallelisms. Despite extensive research efforts to use GPUs to accelerate the H.264/AVC algorithm, it has not been successful to achieve any speed-up over the x264 algorithm that is known as the fastest CPU implementation, mainly due to significant communication overhead between the host CPU and the GPU and intra-frame dependency in the algorithm. In this paper, we propose a novel motion-estimation (ME) algorithm tailored for NVIDIA GPU implementation. It is accompanied by a novel pipelining technique, called sub-frame ME processing, to effectively hide the communication overhead between the host CPU and the GPU. Further, we incorporate frame-level parallelization technique to improve the overall throughput. Experimental results show that our proposed H.264 encoder has higher performance than x264 encoder.
Similar content being viewed by others
References
Merritt et al.: x264: a high performance H.264/AVC encoder. http://neuron2.net/library/avc/overview_x264_v8_5.pdf
Swaroop, K., Rao, K.: Performance analysis and comparison of JM 15.1 and Intel IPP H.264 encoder and decoder. In: Proceedings of the southeastern symposium on system theory, SSST, Mar 2010
Chen, W., Hang, H.: H.264/AVC motion estimation implementation on compute unified device architecture (CUDA). In: Proceedings of the international conference on multimedia and expo (ICME), Apr 2008
Rodriguez-Sanchez, R., Luis Martinez, J., Fernandez-Escribano, G.: A fast GPU-based motion estimation algorithm for H.264/AVC. In: Proceedings of the 18th international conference on Advances in Multimedia Modeling (MMM), pp. 551–562. Klagenfurt, Austria (2012)
Moteiro, E., Vizzotto, B., Diniz, C.: Parallelization of full search motion estimation algorithm for parallel and distributed platforms. Int. J. Parallel Prog. (2012). doi:10.1007/s10766-012-0216-7
Chen, Z., Ji, J., Li, R.: Asynchronous parallel computing model of global motion estimation with CUDA. J. Comput. 7(2), 341–348 (2012)
Kung, M., Au, O., Wong, P., Liu, C.: Block based parallel motion estimation using programmable graphic hardware. In: Proceedings of the international conference on audio, language and image processing (ICALIP), Jul 2008
Schwalb, M., Ewerth, R., Freisleben, B.: Fast motion estimation on graphics hardware for H.264 video encoding. IEEE Trans. Multimed. 11(1), 1–10 (2009)
Chen, M., Chiang, Y., Li H., Chi M.: Efficient multi-frame motion estimation algorithms for MPEG-4 AVC/JVT/H.264. In: Proceedings of the international symposium on circuits and systems (ISCAS), May 2004
Zhou, Z., Sun, M., Hsu, Y.: Fast variable block-size motion estimation algorithms based on merge and split procedures for H.264/MPEG-4 AVC. In: Proceedings of the international symposium on circuits and systems (ISCAS), May 2004
Chen, Z., Xu, J., He, Y., Zheng, Z.: Fast integer-pel and fractional-pel motion estimation for H.264/AVC. Vis. Commun. Image. Represent. 17(2), 264–290 (2006)
Zhu, S., Ma, K.: A new diamond search algorithm for fast block-matching motion estimation. IEEE Trans. Image Process. 9(2), 287–290 (2000)
Gui-guang, D., Bao-long, G.: Motion vector estimation using line-square search block matching algorithm for video sequences. EURASIP J. Appl. Signal Process. 2004(11), 1750–1756 (2004)
Cheung, N., Au, O.C., Kung, M.: Highly parallel rate-distortion optimized intra-mode decision on multicore graphics processors. IEEE Trans. Circuits Syst. Video Technol. 19(11), 1692–1703 (2009)
Su, H., Wen, M., Ren, J.: High-efficient parallel CAVLC encoders on heterogeneous multicore architectures. Radio Eng. 21(1), 46–55 (2012)
Pieters, B., Hollemeersch, C.J., De Cock, J.: Parallel deblocking filtering in MPEG-4 AVC/H.264 on massively parallel architectures. IEEE Trans. Circuits Syst. Video Technol. 21(1), (2011)
NVCUVENC. http://developer.nvidia.com/cuda/nvidia-codec-libraries
Wittenbrink, C.N., Kilgariff, E., Prabhu, A.: FERMI GF100 GPU architecture. IEEE Micro 31(2), (2011)
Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. ACM Queue 6(2), 40–53 (2008)
OpenCL with ARM Mali. http://blogs.arm.com/multimedia/775-opencl-with-arm-mali-gpu-computingwith-no-compromises/
Open Computing Language. http://www.khronos.org/opencl/
Acknowledgments
This research was supported by the MKE (The Ministry of Knowledge Economy), Korea, under the ITRC (Information Technology Research Center) support program supervised by the NIPA (National IT Industry Promotion Agency) (NIPA-2012-H0301-12-1011). Also, this work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. 2011-0013479).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ko, Y., Yi, Y. & Ha, S. An efficient parallelization technique for x264 encoder on heterogeneous platforms consisting of CPUs and GPUs. J Real-Time Image Proc 9, 5–18 (2014). https://doi.org/10.1007/s11554-012-0317-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11554-012-0317-y