ABSTRACT
In this paper, we present our experiences in designing, implementing and evaluating efficient applications of the wavefront pattern for block-level motion estimation in video encoding algorithms using OpenCL™ kernels on Intel® Processor Graphics™. We implement multiple solutions exploring different performance considerations, evaluate their pros and cons, present performance data, and provide our recommendations.
- Khronos OpenCL Working Group. The OpenCL specification version 1.2, 2.0. 2015. Retrieved from: http://www.khronos.org/registry/cl/.Google Scholar
- Intel Corporation. 2017. Cl_intel_device_side_avc_motion_estimation Extension Specification. (2017). https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_device_side_avc_motion_estimation.txtGoogle Scholar
- Junkins, Stephen. 2015. The Compute Architecture of Intel® Processor Graphics Gen9. Retrieved from: https://software.intel.com/en-us/file/the-compute-architecture-of-intel-processor-graphics-gen9-v1d0pdfGoogle Scholar
- Wiegand, Thomas, et al. "Overview of the H.264/AVC video coding standard." IEEE Transactions on circuits and systems for video technology 13.7 (2003): 560--576. Google ScholarDigital Library
- Sullivan, Gary J., et al. "Overview of the high efficiency video coding (HEVC) standard." IEEE Transactions on circuits and systems for video technology 22.12 (2012): 1649--1668. Google ScholarDigital Library
- Sullivan G. J. and Wiegand T. (1998) Rate-distortion optimization for video compression. IEEE Signal Processing Magazine, vol. 15, pp. 74--90, ISSN: 1053--5888.Google ScholarCross Ref
- Zhao, Zhuo, and Ping Liang. "Data partition for wavefront parallelization of H. 264 video encoder." Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE International Symposium on. IEEE, 2006.Google Scholar
- Cheung, Nagai-Man, et al. "Video coding on multicore graphics processors." IEEE Signal Processing Magazine 27.2 (2010): 79--89.Google ScholarCross Ref
- Sarwer, Mohammed Golam, and QM Jonathan Wu. "Improved intra prediction of H.264/AVC." Effective Video Coding for Multimedia Applications, Sudhakar Radhakrishnan (Ed.), ISBN (2011): 978--953.Google Scholar
- Hiranandani, Seema, Ken Kennedy, and Chau-Wen Tseng. "Evaluating compiler optimizations for Fortran D." Journal of Parallel and Distributed Computing 21.1 (1994): 27--45. Google ScholarDigital Library
- Prylli, Loic, and Bernard Tourancheau. "Block cyclic array redistribution." (1995).Google Scholar
- Volkov, Vasily, and James W. Demmel. "Benchmarking GPUs to tune dense linear algebra." High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for. IEEE, 2008. Google ScholarDigital Library
- Kumar, Vipin, et al. Introduction to parallel computing: design and analysis of algorithms. Vol. 400. Redwood City, CA: Benjamin/Cummings, 1994. Google ScholarDigital Library
- Gomes, Jeremias M., et al. "Efficient irregular wavefront propagation algorithms on Intel® Xeon Phi™" Computer Architecture and High Performance Computing (SBAC-PAD), 2015 27th International Symposium on. IEEE, 2015. Google ScholarDigital Library
- Aji, Ashwin M., and Wu-Chun Feng. Accelerating data-serial applications on data-parallel GPGPUs: a systems approach. Technical Report TR-08-24, Computer Science, Virginia Tech, 2008.Google Scholar
- Xiao, Shucai, and Wu-chun Feng. "Inter-block GPU communication via fast barrier synchronization." Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on. IEEE, 2010Google ScholarCross Ref
- Liu, Yongchao, Douglas L. Maskell, and Bertil Schmidt. "CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units." BMC research notes 2.1 (2009): 73.Google ScholarCross Ref
- Gupta, Kshitij, Jeff A. Stuart, and John D. Owens. "A study of persistent threads style GPU programming for GPGPU workloads." Innovative Parallel Computing (InPar), 2012. IEEE, 2012.Google Scholar
Wavefront Parallel Processing on GPUs with an Application to Video Encoding Algorithms
Recommendations
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and AnalysisOpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
An OpenCL micro-benchmark suite for GPUs and CPUs
Open computing language (OpenCL) is a new industry standard for task-parallel and data-parallel heterogeneous computing on a variety of modern CPUs, GPUs, DSPs, and other microprocessor designs. OpenCL is vendor independent and hence not specialized for ...
Comments