Synchronous and asynchronous HEVC parallel encoder versions based on a GOP approach
Introduction
Recently, the Joint Collaborative Team on Video Coding (JCT-VC) co-established by ISO/IEC MPEG (Motion Experts Group) and ITU-T VCEG (Video Coding Experts Group), has standardized the next-generation video coding technology called High Efficiency Video Coding (HEVC) [1]. This new standard will replace the current H.264/AVC (Advanced Video Coding) [2] standard in order to deal with nowadays and future multimedia market trends, since 4K definition video content is a nowadays fact and 8K definition video will not take too long to become a reality. Even more, the new standard supports high quality color depth at 8 and 10 bits. HEVC greatly improved the coding efficiency over its predecessor (H.264/AVC) by a factor of almost twice while maintaining an equivalent visual quality [3].
Regarding complexity, in [4], Bossen et al. studied the complexity aspects of HEVC encoding and decoding software. This study concludes that the encoding process is much more challenging than the decoding process, e.g., encoding one second of a 1080p60 HD (High Definition) video with the reference software encoder can take longer than one hour when running in an off-the-shelf desktop computer. Therefore, HEVC encoder optimization will be a hot research topic in years to come.
Several works about complexity analysis and parallelization strategies for the emerging HEVC standard can be found in the literature [4], [5], [6]. Most of parallelization proposals are focused in the decoding side, looking for the most appropriate parallel optimizations at the decoder that provide real-time decoding of High-Definition (HD) and Ultra-High-Definition (UHD) video contents. In [7], [8] the authors present a variation of Wavefront Parallel Processing (WPP) called Overlapped Wavefront (OWF) for the HEVC decoder in which the executions over consecutive pictures are overlapped. In a multi-threaded approach of the HEVC decoder, a picture is decoded by several threads at the same time, being each thread in charge of decoding different Coding Tree Block (CTB) rows. In these works, authors claim that a single thread may continue processing the next picture when it finishes the current one, without waiting for the other threads. These variations allow a better parallel processing efficiency, reducing the overall decoding time. Recently, in [9] the authors mixed tiles, WPP and SIMD (Single-Instruction Multiple-Data instruction set extension to the x86 architecture) instructions to develop a real-time HEVC decoder.
At the moment, only a few works focused on the HEVC encoder have been reported. In [10] authors propose a fine-grain parallel optimization of the HEVC motion estimation module that performs at the same time the motion prediction of all Prediction Units (PUs) available at one Coding Unit (CU). In [11], [12] authors propose a real-time motion estimation block over focusing on the optimization of motion estimation algorithms using an FPGA-based low cost embedded system with a combination of synchronous dynamic random access memory (SDRAM) with on-chip memory of software-based Nios II processors. Through the optimizations of memory access in this platform, time savings of up 53% were achieved in the motion estimation module. In [13] authors propose a parallelization at the intra prediction module that consists on removing data dependencies among subblocks of a CU, obtaining interesting speed-up results with a negligible loss in coding performance. Other recent works are focused on changes in the scanning order. For example, in [14] the authors propose a CU scanning order based on a diamond search obtaining a good scheme for massive parallel processing. Also in [15] the authors propose to change the HEVC deblocking filter processing order obtaining time savings of 37.93% over many-core architectures.
In this paper, we will focus on applying parallel processing techniques to the HEVC encoder in order to significantly reduce the computational power requirements without disturbing the coding efficiency. Instead of focusing the optimization on one specific module of the HEVC encoder, as other proposals do, our proposals use OpenMP programming paradigm working at a coarse grain parallelization level which we call GOP-based level. GOP-based approaches encode simultaneously several Group Of Pictures (GOP). Depending on how these GOPs are conformed and distributed it is critical to obtain good parallel performance, taking also into account the level of coding efficiency degradation. This paper is based upon Migallón et al. [16], including more results and additional research such (a) new asynchronous parallel versions, and (b) a comparison between synchronous and asynchronous parallel versions in terms of computational complexity (coding time) and speed-up.
The remainder of this paper is organized as follows, in Section 2 an overview of the available profiles and parallel strategies in HEVC are presented. Sections 3 and 4 describe the GOP-based parallel alternatives proposed for both synchronous and asynchronous architectures , while in Section 5 a comparison between the proposed parallel approaches is presented. Finally, in Section 6 some conclusions are drawn.
Section snippets
HEVC coding modes and parallel strategies
HEVC follows a hybrid video coding scheme consisting on a sequence of three main steps. First, the spatial or temporal redundancy is exploited to make a prediction of a frame region and, in this way, only the residuum of the prediction and some side information will be encoded. In the second step, the residuum is transformed into the frequency domain and the resulting coefficients are quantized (lossy compression). Depending on the quantization step, we will achieve a higher or lower
Synchronous GOP-based parallel algorithms
We have developed several strategies, described below, but in all of them one GOP is computed by one process. Firstly, we propose synchronous algorithms, in which the synchronization processes are located after the encoding process of a GOP, and the reference pictures are not shared. The main goals of these mild restrictions are, on the one hand, to fill the bit stream as the information is available, and on the other hand, to be able to extend the work to distributed memory platforms without
Asynchronous GOP-based parallel algorithms
The synchronization processes in the previous algorithms imply a fixed and structured Group of Pictures (GOP) computation, i.e., the number and the order of GOPS computed by each process is known (see Fig. 6, Fig. 8, Fig. 10, Fig. 12 and 14). In this section, we propose asynchronous algorithms based on the previously presented synchronous algorithms, in which the next GOP to be computed by one process depends on the encoding state at the time of finishing the computation of the actual GOP. The
Comparison results between synchronous and asynchronous algorithms
Finally, in this section we will compare the synchronous algorithms explained in Section 3, with respect to the asynchronous algorithms presented in Section 4.
Fig. 32 shows a comparison, in terms of relative computational time savings, between synchronous and asynchronous algorithms. As it can be seen, asynchronous algorithms obtain a slight time reduction that is increased as the number of processes does. Since our parallel test platform is an homogeneous computing platform and the
Conclusions
In this paper we have proposed several parallel algorithms of the HEVC video encoder. These algorithms are based on a coarser grain parallelization approach with the organization of video frames in Group Of Pictures (GOP) and different GOP allocation schemes. A good parallel behavior has been shown in the experiments reported, which were obtained using a multicore platform. However the developed algorithms are able to run on distributed memory architectures since a coarser grain parallelization
Acknowledgments
This research was supported by the Spanish Ministry of Education and Science under Grant TIN2011-27543-C03-03, the Spanish Ministry of Science and Innovation under Grants TIN2015-66972-C5-4-R and TIN2011-15734-E.
References (19)
- Bross B., Han W., Ohm J., Sullivan G., Wang Y.-K., Wiegand T.. High efficiency video coding (HEVC) text specification...
- ITU-T, ISO/IEC JTC 1, Advanced video coding for generic audiovisual services, ITU-T Rec. H.264 and ISO/IEC 14496-10...
- et al.
Overview of the high efficiency video coding (HEVC) standard
IEEE Trans Circuits Syst Video Technol
(2012) - et al.
HEVC complexity and implementation analysis
IEEE Trans Circuits Syst Video Technol
(2012) - et al.
Parallel video decoding in the emerging HEVC standard
Proceedings of international conference on acoustics, speech, and signal processing, Kyoto
(2012) - et al.
Review of proposed high efficiency video coding (HEVC) standard
Int J Comput Appl
(2012) - et al.
Parallel scalability and efficiency of HEVC parallelization approaches
IEEE Trans Circuits Syst Video Technol
(2012) - et al.
Parallel HEVC decoding on multi- and many-core architectures
J Signal Process Syst
(2013) - et al.
High efficiency video coding (HEVC) text specification draft 10
Technical report JCTVC-l1003
(January 2013)
Cited by (2)
SPATIO-TEMPORAL PARALLELIZATION SCHEME FOR HEVC ENCODING ON MULTICOMPUTER SYSTEMS
2022, Proceedings - International Conference on Image Processing, ICIP